6600-style out-of-order scoreboard designs (ariane)

600 views
Skip to first unread message

lk...@lkcl.net

unread,
Apr 16, 2019, 3:26:04 AM4/16/19
to RISC-V HW Dev, Florian Zaruba, MitchAlsup, Dr Jonathan Kimmitt
hiya florian,

i'm ccing hw-dev as below may prove useful to other people implementing out-of-order designs.  i found the PDF overview for ariane which shows that you're implementing (incompletely) a CDC 6600-style out-of-order architecture [1]

if implemented in full you will achieve precise exceptions, need no "rollback" mechanism or "physical-architectural-register-file" nonsense, and yet be extremely power-efficient.  this being a known goal of ariane i thought you might appreciate some insights, below.

from the CDC 6600 patent (and the academic literature) most people understand that you need a Q-Table, which can be done either as a 1D binary array of register indices, array of length equal to the number of FUs *or* as an unary matrix of bits, N=FUs, M=NumRegs, where in the M dimension only one bit at a time is set [2]

what most people do *not* understand is that you also need an N x N (N=FUs) "dependency" matrix as well, and that in each cell of that N x N matrix there are *multiple* logic blocks handling the commit-blocking signals (write-hazards), those signals usually designated as coming in from the "top", and the usually-expected read hazards coming in from the side, to meet in each cell in an easy-to-lay-out fashion.

no disrespect intended to chris' team (i love the BOOM branch prediction algorithm, chris) listening to chris cello's design advice here will *not help you* because his team have implemented the TOMASULO algorithm, which involves a Reorder Buffer, which in turn means they need a CAM, and that is severely power-hungry.

although the Tomasulo algorithm is topologically equivalent to a (properly implemented) 6600-style algorithm, the topological morphing required [3] leaves very little that either design may use - or learn - from the other (without a full comprehensive study of both)

the 6600-style algorithm is extremely power-efficient, requires far less gates, and *does not need CAMs*.  instead it uses unary or binary array encoding as a DIRECT substitute for a CAM, providing exactly the same end-result, and only needing a single AND gate to indicate "active detection".

a CAM cleary requires *multiple* AND gates... *PER ENTRY* in the array.

thus it is clearly far more power-efficient to use 1D-binary-array or 2D-unary encoding.

each cell in the N x N Dependency Matrix basically combines in an OR fashion its write hazard lines (commit-blockers).  those commit-blockers may be:

* exception blockers.  these also handle interrupts.
* branch speculation commit-blockers
* LD/ST blockers (LD/ST management is best done as its own OxO matrix that then feeds its write-hazards to the NxN one)
* "the usual" register-based result blockers (write hazards) which everyone thinks, from the academic literature (and the expired patents) on the 6600, is the only thing that scoreboards can be used for [hint: it's not].

* exception blockers basically stop all down-stream instructions from committing.  once the instruction that *MIGHT* have to throw an exception *KNOWS* that it does not need to throw an exception, it drops its write-hazard line.  if it does, it throws the "Go-Die" switch on down-stream instructions.  in this way you get PRECISE exceptions.  ta-daaaa.

THIS PRECISE EXCEPTION CAPABILITY IS NOT ACKNOWLEDGED BY THE ACADEMIC LITERATURE ON THE 6600, and it is leading to designers creating extremely power-inefficient designs [or a design suffering unnecessarily from imprecise exceptions]

* interrupts *as* an exception means that you *do not* need to do global masking.  commit-blocking on exceptions (interrupts) *is* in effect "masking" [selectively].  it's directly functionally equivalent, yet the "masking" idea that you described last week, Florian, is the "Nuclear Option", blasting away all and any possibility of *any* interrupts.

* branch speculation commit-blockers basically also hook into the "go-die" (instruction cancellation) capability that is needed for clearing out the Function Units if an exception occurs.  i nick-named it the "Schroedinger wire" :)  at the branch-speculation point, you very simply hold a commit-block on all down-stream instructions (in *both* paths if you choose to do that), and when the branch is known to be taken, you either drop the write-hazard or you call the "go-die" (instruction cancel) wire.

it's surprisingly very simple!

* LD/ST blockers, they require their own separate OxO (very-sparse) matrix, with their own sparse-array of hazards (down the main diagonal): LDs block stores, and STs block LDs.  *both* the LD and ST write-commit-blocker signals drop down from above, into the *NxN* Dependency Matrix.

* you have the "usual" register-based write-blocking, and also (which i really like), you have operand (result) forwarding automatically built-in.  however, you have added logic that detects whether an exception has occurred into the operand forwarding block... where, actually, as you can see above, exceptions *need* their own write-hazards, and once cleared, it will be SAFE to forward the operands.

in scoreboard.sv i am not seeing any evidence of the combining of those signals, meaning that you are running into the very difficulties with exceptions (interrupts) that you outlined last week on the list, and you will also be running into difficulties with LD and ST.

i did note that you have a per-FU instruction-in-flight counter, which is excellent.  the reason why is because these counters can be used to turn this into a multi-issue design very very easily.  you very simply:

* count up the number of FUs ready to "commit" (that have no hazards remaining) - using a popcount
* set a threshold of the number that are *ALLOWED* to commit
* if the per-FU instruction-in-flight counter is less than this threshold, allow commit!
* count up the number that were *actually* committed...
* subtract that global count from EVERY one of the per-FU instruction-in-flight counters.

then all you need do is extend the instruction decode and issue phase to drop more than one result into the system: ta-daaa, now you have turned a single-issue design into a multi-issue one :)

of course the multi-porting on the operand forwarding will go up, and the multi-porting on the register file will go up as well: in the Libre RISC-V SoC we stratify the register file (and double/quadruple the number of FUs as well, to match) to avoid this.

this approach has the rather weird side-effect that one from each of register result modulo 4 may be multi-issue committed in any given cycle.  by that i mean that if there are operations which can commit to r4, r8, r12 and r16, these MUST be done sequentially (in 4 cycles), however if we have operations which can commit to r1, r6, r3 and r8 (modulo 4 those are 0,1,2,3), that's okay because of the 4-bank stratification.

it's a little weird however it means that we can use 4x 32-deep banks of 3R1W SRAM instead of requiring a *COMPLETELY INSANE* 256-entry 64-bit 12R4W ported SRAM.

anyway, above is some insights that will help you to avoid a *lot* of design pain, and, if implemented, will give you a stonkingly-good power-performance ratio, for very little effort, and without having the kinds of compromises *normally* expected of a Tomasulo (ROB) algorithm *or* of what is *believed* (incorrectly) that a 6600-style architecture is only capable of.

btw if you're happy to listen, i'm happy to describe an algorithm that provides reversible "Q-Table history", allowing full and precise restoration of register names when operand-forwarding opportunities are detected and an exception occurs.

the "normal" way to deal with this situation is to trash the ENTIRE scoreboard (obliterating every single in-flight instruction), then place the system into a treacle-like "single-issue single-FU-execution" mode and to walk forward at a snail's pace until the exception hurdle has been cleared.

the scheme that i came up with in december (thanks to mitch alsup for the very inspiring discussions) called "Q-Table history" allows full detection of in-flight operand-forwarding opportunities AND reversibility and restoration on branch speculation and exceptions.

this would result in a significant reduction in the number of writes to the register file: depending on the history depth (properly allocated) all *and any* operand-forwarding opportunities will be detected and eliminated.

lastly, if anyone would like to receive copies of mitch alsup's book chapters that augment the book "Design of a Computer", written by James Thornton and Seymour Cray, i have permission to send copies, you will need to give credit and acknowledgement to mitch.  these book chapters go into much better detail than the overview above, and include the gate-level diagrams needed to do a proper implementation of the capabilities mentioned briefly above.


[2] most implementors choose the 1D binary array because of traditional register-file designs as SRAMs.  a unary matrix is actually much more efficient because the unary array *is* the address, meaning that the SRAM's "traditional" address-decode block may be REMOVED (is completely redundant)

[3] if you limit the number of rows in each Tomasulo Reservation Station to one and only one, it effectively becomes the direct equivalent of the "operand latches" present at the front of a 6600-style Function Unit.  this in turn allows the CAM of the ROB to be replaced with an unary matrix, as it is (was) only the *multi-entry* capability of Reservation Stations that required the introduction of a CAM in the first place.

lk...@lkcl.net

unread,
Apr 19, 2019, 9:22:10 AM4/19/19
to RISC-V HW Dev, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
hiya florian, appreciate that you're busy with current tasks - i spotted a potential major design flaw that... how can i put this... "could be the cause of the very debugging and investigation tasks that i surmise may be preventing you from having the time to evaluate mitch alsup's book chapters and the information that i am providing on 6600-style scoreboard design" shall we say.

register-renaming (and built-in operand forwarding) is automatically achieved with the 6600-style scoreboard design by way of the Q-Table working along-side the NxN Function Unit Dependency Matrix.  operand forwarding in the 6600 was achieved by using same-cycle (actually, falling-edge) write-through capability on the Register File.  in modern processors, "write-through" of SRAMs can be used for this exact same purpose. this *can* be augmented by a separate additional operand-forwarding bus, very similar to an SRAM except... without the SRAM.

also, pipelining in the 6600 was sort-of achieved by a "revolving door" created by the three-way interaction between Go_Read (register file read), Go_Write (Function Unit latch write) and Go_Commit [write to regfile, i think: mitch can you confirm?].  only two of these signals were permitted to be raised at any one time, and only when one of them was HI could the next-in-the-chain go HI, at which point the previous one would drop, hilariously a bit like a very short caterpillar going round in circles, forever.  operand-forwarding was achieved when two of these lines (one of them Go_Read) were HI for example.

the important thing is that actual pipelining was not introduced until the 7600, not many details are known about the 7600, however the Function Units input-latches and corresponding associated output-latch had this "revolving door" 3-way MUTEX on each of them *even in the 7600*.

in the creation of the ariane scoreboard.sv, however, there is a problem, that is masked / hidden by the fact that the integer ALU operations are all single-cycle.

* on the div unit, you *do* have a form of MUTEX that blocks the Function Unit input operands from being used whilst the DIV operation is in progress.  you will therefore not experience any problems with DIV results.

* on the integer FUs, these are single-cycle, so the problem is *HIDDEN*.  note however there will also be a performance ceiling (64-bit MUL in a single cycle) due to gate latency, that, if fixed by creating a multi-stage integer-mul, *WILL* result in problems.

* on the LD/ST FU, you *may* see problems.  i haven't investigated in depth, because the design is deviating from the 6600 by not splitting out LD/ST into its own separate sparse-array matrix (see section 10.7)

* on the FPU FUs, which i see no evidence of a MUTEX, and i assume that they're multi-stage, you *WILL* experience problems.

the problem is: that without MUTEXes blocking the Function Unit from re-use until the output is generated, the Q-Table will become corrupted.  this will show up *ONLY* during exceptions (and possibly branch speculation cancelling), because it is exceptions where rollback is initiated.

when a destination (result) register number is transferred through the Q-Table to a Function Unit, it does so by assuming that there is a commit-block on that register which "preserves" the name of that destination register.  this is sort-of done by preventing its destruction using the write-hazard infrastructure until such time as its committing can "make it disappear" safely.

thus: only when that result is *actually available* is it SAFE to DESTROY (retire) that result, because at the point at which the result is stored, all "commits" - all write hazards - have been cleared.

by allowing the Q-Table to proceed to a new entry without allowing the write-hazards to be cleared, you are DESTROYING absolutely CRITICAL information.

i repeat:

by allowing the Q-Table to tell the Function Unit that the write-hazards do not matter, rollback is NO LONGER SAFELY POSSIBLE.

the solution is outlined in Section 11.4.9.2 of mitch's book chapters.  reproduced with kind permission from mitch alsup, an image for people who may not have these chapters:


you can see there that there are *four* "apparent" Function Units, all with src1, src2 operand latches, and associated corresponding result latches, so it APPEARS as far as the NxN Function Unit Dependency Matrix that there are FOUR adders (or four FPUs).

there are NOT four FPUs.

there is only ONE (pipelined) FPU.

that FPU however has *FOUR* sets of src1-src2-result latches.

the absolutely critical insight here is to note that the number of FU latch-sets *must* exceed or be equal to the pipeline depth.

* if it is less, the consequence will be that the pipeline will be underutilised.
* if it is greater, there exists the increased ability of the design to undergo register renaming, however bear in mind that the FU NxN Dependency Matrix is, clearly, O(N^2).

one solution in common usage is to merge multiple functions into the Computation Unit, funnily enough exactly as has already been done in both the ariane FPU Function Unit and the ariane integer ALU.

you will be able to check that this is the case by temporarily creating a global "de-pipelining" mutex that only permits a single operation to be carried out at any one time.  only one of an FPU operation, LD/ST operation, Branch operation or Integer operation may be permitted at any one time, *NO PIPELINING  PERMITTED AT ALL*...

... and at that point the problems that i anticipate you to be experiencing (based on an examination of this design) on exceptions and branch prediction should "puzzlingly and mysteriously disappear for no apparent reason".

we have an implementation of a multi-in, multi-out "fan" system in nmigen (if you're able and happy to read python HDL), at least the comments are useful:


the "mid" - multiplexer id - is what is passed in down the pipeline, just as ordinary boring unmodified data, only to be used on *exit* from the pipe to identify which associated fan-out latch the result is to go to.

that mux-id is used here for example:

you can see later at lines 202, 203, 204 and 208, the mid indexes which of the "next stages" to route the incoming data to.  i really like nmigen :)


in the diagram in Mitch Alsup's book you can see that this is replaced with a FIFO (just to the side of the Concurrent Unit aka pipeline).  however that particular design strategy works for a fixed-length pipeline, i.e. it *prevents* early-out and it *prevents* the amalgamation of multiple pipelines (with different lengths) behind a common "ALU" API.

by passing the multiplexer id down through the data, early-out, reordering pipeline layouts, *and* FSMs can all be combined and the multi-in multi-out Concurrent Unit doesn't give a damn :)

early-out pipelines (such as FPU "special cases" for NaN, zero and INF being handled very early in the pipeline) allow less work to be done (less power utilised), however it would be anticipated that this would require dual-porting on the result stage (into the multiplexer).  luckily however we *guarantee* that only *one* of the array of result stages is ever going to be active at any one time, so, bizarrely, ORing of the two possible paths may be deployed as opposed to requiring higher gate count MUXes.

two fan-outs will still be required (one for the early-out path on the FPU pipeline, one for the "normal" path on the FPU pipeline), it's just that the fanned-out outputs from each may be safely ORed together, given the *guarantee* that there will be only one mid in use at any one time.

so... i think that covers it.  summary: you're missing criticical MUTEXes on the Function Unit src1-src2-result latches, without which data corruption will occur (guaranteed) on any form of rollback.  fix that, and you'll have an absolutely fantastic design.

l.



Florian Zaruba

unread,
Apr 19, 2019, 9:49:22 AM4/19/19
to Luke Kenneth Casson Leighton, RISC-V HW Dev, zarubaf, mitch...@aol.com, Dr Jonathan Kimmitt
Dear Luke,

thanks for all the suggestions and the book chapter.

On Fri, Apr 19, 2019 at 3:22 PM <lk...@lkcl.net> wrote:
hiya florian, appreciate that you're busy with current tasks - i spotted a potential major design flaw that... how can i put this... "could be the cause of the very debugging and investigation tasks that i surmise may be preventing you from having the time to evaluate mitch alsup's book chapters and the information that i am providing on 6600-style scoreboard design" shall we say.
Unfortunately, this is not the reason as I still have to pursue a PhD degree (and you won't get that solely with engineering an in-order core) ;-). I appreciate your input and I have the two book chapters as my easter lecture. I hope they will be insightful. 

register-renaming (and built-in operand forwarding) is automatically achieved with the 6600-style scoreboard design by way of the Q-Table working along-side the NxN Function Unit Dependency Matrix.  operand forwarding in the 6600 was achieved by using same-cycle (actually, falling-edge) write-through capability on the Register File.  in modern processors, "write-through" of SRAMs can be used for this exact same purpose. this *can* be augmented by a separate additional operand-forwarding bus, very similar to an SRAM except... without the SRAM.

also, pipelining in the 6600 was sort-of achieved by a "revolving door" created by the three-way interaction between Go_Read (register file read), Go_Write (Function Unit latch write) and Go_Commit [write to regfile, i think: mitch can you confirm?].  only two of these signals were permitted to be raised at any one time, and only when one of them was HI could the next-in-the-chain go HI, at which point the previous one would drop, hilariously a bit like a very short caterpillar going round in circles, forever.  operand-forwarding was achieved when two of these lines (one of them Go_Read) were HI for example.

the important thing is that actual pipelining was not introduced until the 7600, not many details are known about the 7600, however the Function Units input-latches and corresponding associated output-latch had this "revolving door" 3-way MUTEX on each of them *even in the 7600*.

in the creation of the ariane scoreboard.sv, however, there is a problem, that is masked / hidden by the fact that the integer ALU operations are all single-cycle.

* on the div unit, you *do* have a form of MUTEX that blocks the Function Unit input operands from being used whilst the DIV operation is in progress.  you will therefore not experience any problems with DIV results.

* on the integer FUs, these are single-cycle, so the problem is *HIDDEN*.  note however there will also be a performance ceiling (64-bit MUL in a single cycle) due to gate latency, that, if fixed by creating a multi-stage integer-mul, *WILL* result in problems.
The multiplier is pipelined in my design. 

* on the LD/ST FU, you *may* see problems.  i haven't investigated in depth, because the design is deviating from the 6600 by not splitting out LD/ST into its own separate sparse-array matrix (see section 10.7)

* on the FPU FUs, which i see no evidence of a MUTEX, and i assume that they're multi-stage, you *WILL* experience problems. 

the problem is: that without MUTEXes blocking the Function Unit from re-use until the output is generated, the Q-Table will become corrupted.  this will show up *ONLY* during exceptions (and possibly branch speculation cancelling), because it is exceptions where rollback is initiated.
In an in-order, single-issue processor, with a single cycle ALU you don't have much branch shadow (zero). So that should not be a problem for the moment. This can be improved once going super-scalar.  

when a destination (result) register number is transferred through the Q-Table to a Function Unit, it does so by assuming that there is a commit-block on that register which "preserves" the name of that destination register.  this is sort-of done by preventing its destruction using the write-hazard infrastructure until such time as its committing can "make it disappear" safely.

thus: only when that result is *actually available* is it SAFE to DESTROY (retire) that result, because at the point at which the result is stored, all "commits" - all write hazards - have been cleared.

by allowing the Q-Table to proceed to a new entry without allowing the write-hazards to be cleared, you are DESTROYING absolutely CRITICAL information.

i repeat:

by allowing the Q-Table to tell the Function Unit that the write-hazards do not matter, rollback is NO LONGER SAFELY POSSIBLE.

the solution is outlined in Section 11.4.9.2 of mitch's book chapters.  reproduced with kind permission from mitch alsup, an image for people who may not have these chapters:


you can see there that there are *four* "apparent" Function Units, all with src1, src2 operand latches, and associated corresponding result latches, so it APPEARS as far as the NxN Function Unit Dependency Matrix that there are FOUR adders (or four FPUs).

there are NOT four FPUs.

there is only ONE (pipelined) FPU.

that FPU however has *FOUR* sets of src1-src2-result latches.

the absolutely critical insight here is to note that the number of FU latch-sets *must* exceed or be equal to the pipeline depth.

* if it is less, the consequence will be that the pipeline will be underutilised.
* if it is greater, there exists the increased ability of the design to undergo register renaming, however bear in mind that the FU NxN Dependency Matrix is, clearly, O(N^2).

one solution in common usage is to merge multiple functions into the Computation Unit, funnily enough exactly as has already been done in both the ariane FPU Function Unit and the ariane integer ALU.

you will be able to check that this is the case by temporarily creating a global "de-pipelining" mutex that only permits a single operation to be carried out at any one time.  only one of an FPU operation, LD/ST operation, Branch operation or Integer operation may be permitted at any one time, *NO PIPELINING  PERMITTED AT ALL*...

... and at that point the problems that i anticipate you to be experiencing (based on an examination of this design) on exceptions and branch prediction should "puzzlingly and mysteriously disappear for no apparent reason".
I am actually not experiencing any more problem: The "design flaw" which produced the buggy behavior of non-idempotent reads was associating interrupts during commit. The observation to do that during decode helped to eliminate that problem. We are happily booting Debian Linux on the FPGA and multi-core SMP Linux on the OpenPiton platform (with PLIC).

we have an implementation of a multi-in, multi-out "fan" system in nmigen (if you're able and happy to read python HDL), at least the comments are useful:


the "mid" - multiplexer id - is what is passed in down the pipeline, just as ordinary boring unmodified data, only to be used on *exit* from the pipe to identify which associated fan-out latch the result is to go to.

that mux-id is used here for example:

you can see later at lines 202, 203, 204 and 208, the mid indexes which of the "next stages" to route the incoming data to.  i really like nmigen :)


in the diagram in Mitch Alsup's book you can see that this is replaced with a FIFO (just to the side of the Concurrent Unit aka pipeline).  however that particular design strategy works for a fixed-length pipeline, i.e. it *prevents* early-out and it *prevents* the amalgamation of multiple pipelines (with different lengths) behind a common "ALU" API.
Our FPU has early out.  

by passing the multiplexer id down through the data, early-out, reordering pipeline layouts, *and* FSMs can all be combined and the multi-in multi-out Concurrent Unit doesn't give a damn :)

early-out pipelines (such as FPU "special cases" for NaN, zero and INF being handled very early in the pipeline) allow less work to be done (less power utilised), however it would be anticipated that this would require dual-porting on the result stage (into the multiplexer).  luckily however we *guarantee* that only *one* of the array of result stages is ever going to be active at any one time, so, bizarrely, ORing of the two possible paths may be deployed as opposed to requiring higher gate count MUXes.
Our FPU manages all that. 

two fan-outs will still be required (one for the early-out path on the FPU pipeline, one for the "normal" path on the FPU pipeline), it's just that the fanned-out outputs from each may be safely ORed together, given the *guarantee* that there will be only one mid in use at any one time.

so... i think that covers it.  summary: you're missing criticical MUTEXes on the Function Unit src1-src2-result latches, without which data corruption will occur (guaranteed) on any form of rollback.  fix that, and you'll have an absolutely fantastic design.
One "problem" which I have, is this merged scoreboard/rob structure (as it is quite area and timing critical). I have the possibility to rollback the entire state speculative state so no data corruption can occur ;-) But I am certain that the current design point can be further improved. I am hoping that the lecture you send me is giving further insights.

Best,
Florian 

l.





--
Florian Zaruba
PhD Student
Integrated Systems Laboratory, ETH Zurich
Skype: florianzaruba

lk...@lkcl.net

unread,
Apr 20, 2019, 3:33:47 AM4/20/19
to RISC-V HW Dev, lk...@lkcl.net, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk


On Friday, April 19, 2019 at 2:49:22 PM UTC+1, Florian Zaruba wrote:
Dear Luke,

thanks for all the suggestions and the book chapter.

On Fri, Apr 19, 2019 at 3:22 PM <lk...@lkcl.net> wrote:
hiya florian, appreciate that you're busy with current tasks - i spotted a potential major design flaw that... how can i put this... "could be the cause of the very debugging and investigation tasks that i surmise may be preventing you from having the time to evaluate mitch alsup's book chapters and the information that i am providing on 6600-style scoreboard design" shall we say.
Unfortunately, this is not the reason as I still have to pursue a PhD degree (and you won't get that solely with engineering an in-order core) ;-).

nice!  okok will try to keep it short
 
I appreciate your input and I have the two book chapters as my easter lecture. I hope they will be insightful.

the shakti team implemented the same approach, independently: Professor Kamakoti re-derived the 6600-style Q-Table concept.  so there is working source code to examine, for parallels.
 
... and at that point the problems that i anticipate you to be experiencing (based on an examination of this design) on exceptions and branch prediction should "puzzlingly and mysteriously disappear for no apparent reason".
I am actually not experiencing any more problem: The "design flaw" which produced the buggy behavior of non-idempotent reads was associating interrupts during commit. The observation to do that during decode helped to eliminate that problem. We are happily booting Debian Linux on the FPGA and multi-core SMP Linux on the OpenPiton platform (with PLIC).

 fantastic!

early-out pipelines (such as FPU "special cases" for NaN, zero and INF being handled very early in the pipeline) allow less work to be done (less power utilised), however it would be anticipated that this would require dual-porting on the result stage (into the multiplexer).  luckily however we *guarantee* that only *one* of the array of result stages is ever going to be active at any one time, so, bizarrely, ORing of the two possible paths may be deployed as opposed to requiring higher gate count MUXes.
Our FPU manages all that. 

that's very cool, i'll take a look.
 
so... i think that covers it.  summary: you're missing criticical MUTEXes on the Function Unit src1-src2-result latches, without which data corruption will occur (guaranteed) on any form of rollback.  fix that, and you'll have an absolutely fantastic design.
One "problem" which I have, is this merged scoreboard/rob structure (as it is quite area and timing critical).

it's very very important to remember that only the combination of separate Q-Table plus correctly-implemented Dependency Matrix, with associated single-row Reservation Stations aka Function Units with those MUTEXes on them is directly functionally one-for-one equivalent to everything that Tomasulo and a ROB provides.

and that it's extremely efficient in terms of power and gates.

* on the RS/FU src1-src2-result latches, D-Latches (3 gates) can replace flip-flops (10 gates each)
* there's no CAMs (single-bit AND-gate testing of unary matrices replaces the CAM)
* combining all of the write-hazards (and read-hazards) that block commit is just... an OR-gate of single-bit inputs.

phrase such as, "surely it has to be more complex than that" and "surely it can't possibly replace the need for separate PRF-ARFs" and "surely it can't obviate the need for a ROB" tends to spring to mind quite often.


I have the possibility to rollback the entire state speculative state so no data corruption can occur ;-)


yes, rollback is the "absolutely terrible" solution, aka "history / snapshots" - it requires detection of the problem, rollback of the full state (CSRs, regfile, everything), a full reset / destruction of *all* in-flight data, then putting the processor into "Seriously Slow Crawl" Mode, and guessing at how long the processor can run like that before being allowed to be de-throttled.

and it's totally unnecessary!

But I am certain that the current design point can be further improved. I am hoping that the lecture you send me is giving further insights.


let's hope it benefits other people as well.

l.

k...@dspia.com

unread,
Apr 25, 2019, 2:43:24 AM4/25/19
to RISC-V HW Dev, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
Can someone share the book chapters being discussed in these posts?

Thanks

-K

lk...@lkcl.net

unread,
Apr 26, 2019, 9:36:11 AM4/26/19
to RISC-V HW Dev, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
(with thanks to Florian for kindly agreeing to publish what was formerly a private discussion, this may be of benefit to others in the future.  i'm also attaching two key screenshots, reproduced with kind permission from Mitch Alsup.  discussion edited slightly).

On Tuesday, April 23, 2019, Florian Zaruba <zar...@iis.ee.ethz.ch> wrote:
Hi Luke,

so I've taken the time and read the pages you have sent me, thanks again. It contains some neat tricks.

Seymour Cray was a genius, well ahead of his time. Mitch learned from that. He is one of the only other people I know who thinks at the gate level and is no longer constrained by NDAs.
 
He explained to me that it is only Intel and AMD that still do massive gate level designs.

Everyone else has moved to HDLs, and it is causing significant design inefficiencies, and causing Foundries to drop support for certain kinds of cells ( t gates for example ).


I am still a bit confused on a couple of things:

1. The paper mentions a 8r/4w register file.. That seems quite unnecessary big for a single issue core.


That's because the context is not a single issue core. The chapters discuss *modernising* 6600.

Also remember it can be used for vector processing, single issue instr yet hit the regfile massively (eg SIMD).

Mitch was responsible for the design of AMD's 5ghz K9 architecture.

He was also a key architect behind AMD's GPU.

IIRC that section is explaining how to allocate resources correctly.

Analyse the workload (number of FMACs, ratio of LDs, then make sure the pipelines match it, then make sure the regfile matches that.
 
 You won't retire much more than one instruction and definitely not read more than two operands (let us concentrate on the integer part). Even for a dual issue approach, you read approximately 1.5 registers per instruction. So three ports should be sufficient. 

Not for a multi issue design it's not. That's the missing context.
 

Not a big deal I think that can be circumvented by some read/write port handshaking on the regfile

Yes. Remember to use write-thru SRAM, otherwise you get a clocks worth of unnecessary delay.

Then the regfile effectively becomes an "operand forwarding bus" as well!
 
.
2. Unfortunately, the drawings are quite hard to read as they are blurry pixel graphics

Moo? Hm perhaps you have a poor quality PDF viewer. Try xpdf. Old, boring, and effective. I get v hi res images on my 3000x1800 laptop LCD out of this PDF, with xpdf.

Debian: apt-get install xpdf
 

 and the style is not very consistent or self explanatory I am sure I missed a couple of points there.

The context is that the chapters are an extension of the original book by J Thornton, "Design of a Computer". Google it, the PDF is online. James' wife says he gave permission for the scans to be put online, it was quite touching, J Thornton was clearly very old at the time the correspondance seeking permission took place.

It also helped me to be on comp.arch for nearly 3 months solid, talking with Mitch direct, to fill in the missing gaps.

Happy to do the same for you, do prefer it to be public discussion though, so others can benefit.

Can we fwd to hw-dev or libre-riscv-dev? Would like my team to be able to see info below, as well as students in future, studying our design.


 Also, there seem to be transmission gates in the drawings which I think are meant to be some kind of storage (flip-flops).. 

Quite possibly.... although in talking with Mitch he explained that only D latches are needed (3 gates). Flip flops (10 gates) are usually used by non-gate-level designers because they're "safe" (resettable).

And have a high cost.
 
3. The text talks about continuous scoreboard and dependency matrix scoreboard. 


Yes. They are definitely separate.
 
The former seems to be a distributed version which takes all the information from the reservation stations/FU and generates global signaling. Does it require extra storage somewhere centralized or can everything be computed combinatorial?

It looks that way! It really is much simpler than we have been led to believe.
 
You can confirm this by examining the original circuit diagrams from Thornton's book.

Yes, the full gate level design is in that book. It's drawn using ECL as they literally hand built the entire machine using PCBs stuffed with 3 leg transistors you can buy from RSOnline, today!

 Although not entirely clear from the text it seems that the continuous scoreboard is preferable.

Yes.
 
4.. What information exactly does the reservation station/FU need to contain?

Nothing! It's a D Latch bank! That's it! Latches for the src ops, latch for the result, and ... errr... that's it.
 
Mitch and I had a debate about this (details), happy to relate when you have time.

 I assume from the drawings that it also needs to capture the write data from the computation unit. 

Yes.
 
5. I am desperately missing a high-level diagram on how the different things fall in place. 

Believe it or not, it is so much simpler than you may have been led to believe, the diagram on page 32 chap 10 *really is everything that is needed* 

Or p28 or p30. Different highlighting on different sections.


6.. The text talks a lot about latches. I am not sure whether latches in the circuit sense are meant. 


Yes D latches. Really D latches. Not flip flops.

If so I would recommend you to take a different approach. Some text also indicates that you are making use of the transparent phases of latches. This all seems rather dangerous.

Mitch has been doing gate level design for over 45 years. His design input is why AMD has had such high performance designs at significantly lower clock rates than Intel.

Remember in the 90s and early 2000s how AMD had to quote "fake dash equivalent" clock rates compared to Intel cores?

Now you know why. The designs used 6600 where Intel used Tomasulo, therefore Intel was an inferior less power efficient design, end of discussion.

He did have one design failure however, said it was a bitch to debug, chip went into random state after reset. I forget how he said he learned from that. It is on comp.arch somewhere.

Basically due to his serious amounts of experience at gate level design, if Mitch says only latches are needed, I trust him.

 
7. One part of the text (11.1.1) talks about instructions can enter the FUs even if they have WAW hazards, it juts needs to make sure it doesn't write its result back until the WAW hazards has been cleared. 


Yes. Commit phase definitely cannot proceed until all hazards - read and write - are cleared.

Instructions can be ALLOCATED to FUs, they just sit there doing nothing, just like Tomasulo RS's.

That allocation is IMPORTANT. it preserves temporal relationship, pending availability of resources basically.


The rest of the text talks about issue stalling if a WAW hazard has been detected. 

Yes. This is equivalent to when the Tomasulo algorithm drops a ROB# into a Reservation Station instead of an actual register value.

So yes it really is the instruction issue.

8. What does "issue" actually mean?

Exactly what it says.

Instruction, from decode phase, goes through Q Table processing and simultaneous FU allocation, all in 1 clock.


 Does it mean putting it in the reservation/station FU? 


Yes. And doing Q Table update which MUST BE Simultaneous.
 
Q Table values are latched btw.  Not combinatorial.


Or does it mean bringing it to the next stage aka read operands?

 No not exactly.  Read is controlled by READ hazard lines.

This MAY occur on the same cycle even for the current issued instruction (hence why everything is a combinatorial block)


 The image in 11.3.1 seems to indicate the former.  Do you really have to keep a separate queue of unissued instructions?

Yes, this is the FIFO that is discussed on hw-dev occasionally. The buffer that allows 16/32 instr length to be dealt with.

 Bear in mind though that the 6600 is like the 68000 in that it has separate address registers, so you kinda have to remember that not everything described will fit RV exactly.


 What prevents you from putting them in the reservation station and marking them as "not issued"?

It would require an extremely comprehensive analysis to work out the consequences. Beyond my time and ability, however I am pretty confident that it would be counterproductive or turn out to be unnecessary.

My only advice then would be, "don't go there" :)


9. I am not really getting the placement of read reservations. If an instruction is being issued it places read and write reservations. 


I think of them as "blockers". Most people call them hazards.

Read hazards (blockers) are like the OTHER SIDE of the regfile which had the WRITE hazard.

So... in effect... read hazards are blockers on instructions BEFORE the current one. Basically the required reg value hasn't been written yet, plus maybe, yes, there are other FUs reading the regfile (not enough ports) so you can't read yet.

And write hazards are blockers which say, "if you go ahead with this write, it will be impossible to undo, or there is another instruction that has to write to the regfile BEFORE you do (in the same reg#), so don't do it! ..... Yet".

If the instruction unconditionally places a read reservation you are throwing away the temporal relationship? 


If the Q table is not respected: yes.

This took me a LOT of watching online videos to understand what the hell the Q Table does.

 It is quite amazing how simply keeping track of the last written register is sufficient to do register renaming.  However it is ONLY possible by respecting the read and write hazards and doing those FU Reservations.

Basically once the FU is reserved, it becomes a marker that represents a virtual (future) result. Just like the ROB# of Tomasulo.

Mess with that at peril.

How do you maintain order on the read operands if you issue out of order (e.g. if you defer issuance of one instruction which would be vital for the correct dependency)? In general how do you keep a "temporal order"? 

Through those read and write hazards. Through the Q table.

That's it.

So by respecting the read and write hazards, this creates a linked list using a bitmatrix, out of the FUs.

A ROB is therefore no longer needed because the write dependencies - as bits in the FU Dep matrix - these ALREADY PRESERVE INSTRUCTION ORDER!
 
This is what proponents of Tomasulo fail to understand. They see no ROB Queue and think, "omg, the instruction order has been destroyed".

This is blatantly false.  Each instruction has a write dependency to block all future instructions from proceeding to the commit phase.  AT NO TIME can an instruction get COMMITTED out of order.

Done.

Simple.

Instr order preserved.

So this is really crucial to understanding.  The results can be GENERATED out of order, however the write hazards make ABSOLUTELY certain that they are COMMITTED in order.

Once that is understood, the beauty, simplicity and elegance of this design starts to fall into place.


Maybe if I'd know the exact fields of the scoreboard that would help me understand it.

Q table. That is all.

This aspect *is* well understood from the patent and the academic literature. Almost any online video on youtube about scoreboards will do.

What the online material does NOT do is explain how write hazards can be successfully applied to LD/ST, branch speculation and exceptions.

The academic literature also fails to highlight the significance of the Dependency Matrix, focussing on the Q Table and glossing over the FU NxN bitlevel combinatorial block because Q table was what was in the 6600's primary patent.

l.

2019-04-26_14-28.png
2019-04-26_14-29.png

高野茂幸

unread,
Apr 26, 2019, 10:37:34 AM4/26/19
to lk...@lkcl.net, RISC-V HW Dev, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
Hi,

J Thornton-san’s approach is alternatively called a scoreboard, this can be used for googling.
And, University of Wisconsin at Madison, Grinder Sohi-san and J Smith-san’s publications are good material especially for superscalar study.

Best,
S.Takano


2019年4月26日(金) 22:36 <lk...@lkcl.net>:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/ce05d1d6-67f3-45fb-a852-8174424dbb83%40groups.riscv.org.

Dr Jonathan Kimmitt

unread,
Apr 26, 2019, 11:04:20 AM4/26/19
to lk...@lkcl.net, RISC-V HW Dev, zar...@iis.ee.ethz.ch, mitch...@aol.com

See comments inline Luke and best of luck with your implementation. I would recommend that you turn those schematics into a Verilog simulation model (it supports T-latches which are called tranif0/1 devices. Depending on the usage it may result in so-called registered nets that have an impact on simulation speed).

so the reason why T-gates are not much used in UDSM designs is that it is difficult to represent their behaviour as a digital timing library due to the bi-directional nature and the lack of any gain in the pass transistors. It's not a problem if you are happy to verify your design at transistor level complete with layout parasitics. D-latches are widely used and in fact Ariane has a D-latch version of its register file. But they do compromise the performance of ATPG which is another automatic process which is tedious to perform manually.

Quite possibly.... although in talking with Mitch he explained that only D latches are needed (3 gates). Flip flops (10 gates) are usually used by non-gate-level designers because they're "safe" (resettable).

And have a high cost.
 
3. The text talks about continuous scoreboard and dependency matrix scoreboard. 


Yes. They are definitely separate.
 
The former seems to be a distributed version which takes all the information from the reservation stations/FU and generates global signaling. Does it require extra storage somewhere centralized or can everything be computed combinatorial?

It looks that way! It really is much simpler than we have been led to believe.
 
You can confirm this by examining the original circuit diagrams from Thornton's book.

Yes, the full gate level design is in that book. It's drawn using ECL as they literally hand built the entire machine using PCBs stuffed with 3 leg transistors you can buy from RSOnline, today!

 Although not entirely clear from the text it seems that the continuous scoreboard is preferable.

Yes.
 
4.. What information exactly does the reservation station/FU need to contain?

Nothing! It's a D Latch bank! That's it! Latches for the src ops, latch for the result, and ... errr... that's it.
 
Mitch and I had a debate about this (details), happy to relate when you have time.

 I assume from the drawings that it also needs to capture the write data from the computation unit. 

Yes.
 
5. I am desperately missing a high-level diagram on how the different things fall in place. 

Believe it or not, it is so much simpler than you may have been led to believe, the diagram on page 32 chap 10 *really is everything that is needed* 

Or p28 or p30. Different highlighting on different sections.


6.. The text talks a lot about latches. I am not sure whether latches in the circuit sense are meant. 


Yes D latches. Really D latches. Not flip flops.

If so I would recommend you to take a different approach. Some text also indicates that you are making use of the transparent phases of latches. This all seems rather dangerous.
So with the D-latches we have to conduct further analysis to check for pulses which could occur due to clock skew in the transparent period. The D-flipflop which is just two D-latches in series on opposite clocks makes that analysis easier. Again there is an impact on testability because you cannot capture the signal state in the transparent phase, so traditional fault simulation is needed to check testability.

Samuel Falvo II

unread,
Apr 26, 2019, 11:26:50 AM4/26/19
to Luke Kenneth Casson Leighton, RISC-V HW Dev, Florian Zaruba, mitch...@aol.com, Dr Jonathan Kimmitt
On Fri, Apr 26, 2019 at 6:36 AM <lk...@lkcl.net> wrote:
(with thanks to Florian for kindly agreeing to publish what was formerly a private discussion, this may be of benefit to others in the future.  i'm also attaching two key screenshots, reproduced with kind permission from Mitch Alsup.  discussion edited slightly).

This whole thread is over my head at the moment, but thanks to everyone for making it available for eventual absorption.

--
Samuel A. Falvo II

Samuel Falvo II

unread,
Apr 26, 2019, 12:40:19 PM4/26/19
to Luke Kenneth Casson Leighton, RISC-V HW Dev, Florian Zaruba, mitch...@aol.com, Dr Jonathan Kimmitt
On Fri, Apr 26, 2019 at 9:32 AM Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
... sounds all very obvious so far, right? :) now the tricky bit:

Unfortunately, no.  :)  Remember, my processor design experience starts with Forth/stack CPUs [1] and ends with 6502-style PLA instruction decoders[2].  I have a lot to learn.

[1]: https://github.com/KestrelComputer/kestrel/tree/master/cores/S16X4A

lk...@lkcl.net

unread,
Apr 27, 2019, 7:11:00 PM4/27/19
to RISC-V HW Dev
that's odd... neither my message nor mitch's made it through to hw-dev, even after 2 days.




Luke Kenneth Casson Leighton <lk...@lkcl.net>
Apr 26, 2019, 5:32 PM (2 days ago)
to Samuel, RISC-V, Florian, mitch...@aol.com, Dr




On Friday, April 26, 2019, Samuel Falvo II <sam....@gmail.com> wrote:

This whole thread is over my head at the moment, but thanks to everyone for making it available for eventual absorption.

There is I believe both a mental and a practical way to turn a single issue inorder design into a degenerate case of a scoreboard style OoO one:

* split out the pipeline and add a bank of incoming src latches and outgoing dest latches. Numbers in the array/bank equal to the depth of the pipeline
* add muxids to the src/dest latches so that results can be associated with the src ops properly on exit from the pipeline

* if there is operand forwarding built-in to the pipeline infrastructure, REMOVE it (and use a writethru regfile instead)

* if there is PRF ARF (physical / architectural regfiles), remove that too.

... sounds all very obvious so far, right? :) now the tricky bit:

* where previously there is "stalling due to LD/ST conflicts", split out the "read blocking" and "write blocking" and "commit now" wires, and route them instead to a matrix.

* where previously there is "stalling because the INT pipeline has to wait for a register result because the next instruction needs it", do the same thing.

* where the DIV FSM (if there was one) complicates the design and causes all sorts of stall messes all over the place, do the same thing.

... you see where that's going? Same thing for exceptions / interrupts, same thing for branches.

Now all you do is, add a Q Table, which is nothing more complex than an a 1D array of latches containing a Register Number, and... errr... you're done.

Congratulations, an in-order design has been turned into a degenerate OoO one.

And, interestingly, much of the awkwardness associated with having to propagate pipeline "stall" mechanisms throughout the design, all those are gone, and are now managed cleanly in one place.

[Note, there was, is, and shall be NO MENTION of "speculation" in the above. Speculation is NOT a hard absolute design requirement in an OoO design, it is just an optimisation that happens to be really easy to add *to* an OoO design. This appears to be a very common misconception because the performance gains are so high, and it is so easy to add, nobody doesn't not do it. Still, correlation != causation and all...]

---

Now all you do if you want some parallelism, is, crank up the repeat button on the number of pipelines, extend the size of the FU Matrix to match, and add some more ports on the regfile to cope.

Making it multi issue and keeping it a precise engine is a little trickier, basis: keep a count of the number of in-flight instructions not yet committed.

For each FU not yet committed, subtract the total number committed in one cycle from ALL FUs not yet committed.

On next cycle, if any FU has a count less than the number PERMITTED to commit, it MAY commit.

It is not quite that simple, some FUs may try to write to the same regnum, so there is a little bit of futzing to sort out there, which, again, that "commit count" can be used to work out who has priority.

---

So, summary: keeping single issue and yet using Dependency Matrices actually simplifies an inorder design, Q Table provides reg renaming "for free", and as long as speculation and multi issue are not attempted to be added as well, it *remains* simple.

lk...@lkcl.net

unread,
Apr 27, 2019, 7:13:31 PM4/27/19
to RISC-V HW Dev
again, mitch's response, which can clearly be seen to be sent to hw-dev, is nowhere to be seen on groups.google.com hw-dev archives.  forwarding.

Mitchalsup

Apr 26, 2019, 10:33 PM (2 days ago)
to jrrk2lkclhw-devzarubaf


Mitch Alsup


THere are tow points to be made, here::

1) when using an array of latches, one can build the scan path at the boundary of the array using the std read-write gates. This is effectively how one tests SRAMs. and how one should test latch based RFs.''

2) when using latches are data capture points in a pipeline one needs to verify 2 points.
2.a) the latch goes closed before the inbound data crosses out of its hold time stable points
2.b) nobody (NOBODY) is "looking" at any data from any latch that is transparent.

Do this and the latch based pipeline is race free.

The SB does this inherently.
Quite possibly.... although in talking with Mitch he explained that only D latches are needed (3 gates). Flip flops (10 gates) are usually used by non-gate-level designers because they're "safe" (resettable).

And have a high cost.
 
3. The text talks about continuous scoreboard and dependency matrix scoreboard. 


Yes. They are definitely separate.
 
The former seems to be a distributed version which takes all the information from the reservation stations/FU and generates global signaling. Does it require extra storage somewhere centralized or can everything be computed combinatorial?

It looks that way! It really is much simpler than we have been led to believe.
 
You can confirm this by examining the original circuit diagrams from Thornton's book.

Yes, the full gate level design is in that book. It's drawn using ECL as they literally hand built the entire machine using PCBs stuffed with 3 leg transistors you can buy from RSOnline, today!

 Although not entirely clear from the text it seems that the continuous scoreboard is preferable.

Yes.
 
4.. What information exactly does the reservation station/FU need to contain?

Nothing! It's a D Latch bank! That's it! Latches for the src ops, latch for the result, and ... errr... that's it.
 
Mitch and I had a debate about this (details), happy to relate when you have time.

 I assume from the drawings that it also needs to capture the write data from the computation unit. 

Yes.
 
5. I am desperately missing a high-level diagram on how the different things fall in place. 

Believe it or not, it is so much simpler than you may have been led to believe, the diagram on page 32 chap 10 *really is everything that is needed* 

Or p28 or p30. Different highlighting on different sections.


6.. The text talks a lot about latches. I am not sure whether latches in the circuit sense are meant. 


Yes D latches. Really D latches. Not flip flops.

If so I would recommend you to take a different approach. Some text also indicates that you are making use of the transparent phases of latches. This all seems rather dangerous.
So with the D-latches we have to conduct further analysis to check for pulses which could occur due to clock skew in the transparent period. The D-flipflop which is just two D-latches in series on opposite clocks makes that analysis easier. Again there is an impact on testability because you cannot capture the signal state in the transparent phase, so traditional fault simulation is needed to check testability.

The latch timing is created in the SB, timed by the pickers in the SB, broadcast from the SB, and result in another latch somewhere transitioning from open->closed. GIven property 2 above one can prove the timing. (or one can eat the area and use flip-flops.)

lk...@lkcl.net

unread,
May 1, 2019, 9:27:17 AM5/1/19
to RISC-V HW Dev

The latch timing is created in the SB, timed by the pickers in the SB, broadcast from the SB, and result in another latch somewhere transitioning from open->closed. GIven property 2 above one can prove the timing. (or one can eat the area and use flip-flops.)

chomp.

an idea occurred to me: if D-latches are serving the same purpose as write-through SRAMs, and SRAMs are a well-understood "thing" that has an accepted verification process, then would it be reasonable to substitute single-address SRAMs *for* D-latches?

l.

Dr Jonathan Kimmitt

unread,
May 1, 2019, 9:47:19 AM5/1/19
to lk...@lkcl.net, RISC-V HW Dev

A write-through RAM is an optimised bit-cell (equivalent to a D-latch) surrounded by sense amplifiers. Logically it is equivalent to a latch, but it would be in no way competitive with a latch in area or power consumption. Also in UDSM technologies RAMs may require a threshold margin optimisation control as well as optional BIST circuitry, which I guess is the part that is interesting for you.

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

Dan Petrisko

unread,
May 1, 2019, 10:29:10 AM5/1/19
to lk...@lkcl.net, RISC-V HW Dev
Hi Luke --

There are a few reasons you would not want to use a single-address SRAM, mostly for physical design.

There are two types of SRAMs: asynchronous SRAMs, which output their result in the same cycle as requests, and synchronous SRAMs which output their result in the cycle after the request.  In order to insert a 'hardened', or highly circuit-optimized, SRAM into your design, one uses a memory compiler provided by the foundry of choice. Asynchronous SRAMs are generally synthesized out of flip-flops (or latches if they are write-through). A typical memory compiler will just refuse to give you a single-address SRAM, which is a non-starter. If they do support it, it will most likely be generated as a latch surrounded by SRAM-y addressing overhead (and possibly mandatory BIST circuitry).

So, you might think to convert your design to use synchronous SRAMs.  SRAMs become more and more area efficient the larger they are; conversely, tiny SRAMs are very area-inefficient.  The control logic starts to overtake the data storage at (rule of thumb) 1 kB.  You might think to combine your "SRAM D-latches" into a larger SRAM with multiple simultaneous reads/writes.  However, the area of an SRAM scales with the number of ports squared.  So anything more than a few ports becomes untenable.

For very small, simple structures, logic elements are vastly more efficient than SRAMs.

Best,
Dan Petrisko


--

lk...@lkcl.net

unread,
May 1, 2019, 7:40:27 PM5/1/19
to RISC-V HW Dev, lk...@lkcl.net


On Wednesday, May 1, 2019 at 3:29:10 PM UTC+1, Dan Petrisko wrote:

For very small, simple structures, logic elements are vastly more efficient than SRAMs.

dan (and jonathon), thank you for the insights.  i was wondering about the address-less (1-element SRAMs), whether to use them (at all) given that a 6600-like design has the register "address" already decoded into mutually-exclusive 1-bit (unary) N-long arrays (N=num(regs)).

adding a binary re-encoder just to get back to "standard" SRAM addressing when the SRAM has to break that binary encoding *back out to unary* seems... well... redundant.

initially i thought it would be ok to put down a suite of address-less SRAMs, however from what you're saying, that's inadviseable.

l.


Dr Jonathan Kimmitt

unread,
May 2, 2019, 4:18:36 AM5/2/19
to lk...@lkcl.net, RISC-V HW Dev

SRAMs will have a word line which goes to all relevant content (i.e. the corresponding bits in the word), but if N is less than some magic, but quite large threshold then using an optimised bit cell is so not worth it. Now the more interesting question is what size of register file would be needed to justify a dedicated register file IP. This will of course depend on the number of simultaneous ports in use. A large number of ports will degrade timing because of the number of word lines hanging off each cell. For more details refer to (for example) Chapter 9 of "Fundamentals of Modern VLSI Devices" by Yuan Taur and Tak H Ning. My understanding is that you are planning to have a register file with a large number of ports.

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

lk...@lkcl.net

unread,
May 2, 2019, 7:17:24 AM5/2/19
to RISC-V HW Dev, lk...@lkcl.net


On Thursday, May 2, 2019 at 9:18:36 AM UTC+1, Dr Jonathan Kimmitt wrote:

SRAMs will have a word line which goes to all relevant content (i.e. the corresponding bits in the word), but if N is less than some magic, but quite large threshold then using an optimised bit cell is so not worth it. Now the more interesting question is what size of register file would be needed to justify a dedicated register file IP. This will of course depend on the number of simultaneous ports in use. A large number of ports will degrade timing because of the number of word lines hanging off each cell. For more details refer to (for example) Chapter 9 of "Fundamentals of Modern VLSI Devices" by Yuan Taur and Tak H Ning. My understanding is that you are planning to have a register file with a large number of ports.


ok, so digression from the topic at hand, this becomes specific to the Libre RISC-V CPU/VPU/GPU, not the 6600 scoreboard in general.

we're planning:

* 4x 64-entry stratified banks of
* 32-bit-wide 3R1W write-thru with
* byte-level write-enable lines and
* separate operand-forwarding bypass with "Q-Table history / name-restoring" (nothing to do with the regfile itself)

yes, really, 32-bit-wide register files on an RV64GC/SV system, yes, really, byte-level write-enable.

this gives a total of 128 64-bit registers, where *PAIRS* of banks are required to be read/written in order to provide 64-bit read/write.

due to the individual byte-level write-enable lines, 8 and 16-bit SIMD may be carried out *WITHOUT* needing an extra read cycle.

it is extremely weird.  it means that 64-bit operations may be (up to) dual-issue, yet 32-bit vectorised operations may be (up to) quad-issue.

it does however mean that we do not have insane 10R3W or 12R4W register file porting, yet have really quite decent theoretical maximum performance, and solve one of the most intractable problems of computer science: shared SIMD regfile problems.

the dynamic "name restoring" - precise name-restoring provided by the Q-Table "history" innovation - will allow us to detect operand forwarding opportunities that are normally only done by extremely advanced processors, making the operand forwarding bus much more important than it normally would be, reducing the *need* for significant register porting.

other innovations include having separate Function Units (crucially, with their own Reservation Stations) for 8 and 16-bit operations of non-power-of-two length, that raise *HIERARCHICAL* hazards on the register(s) that the 8/16-bit operation is embedded in, in an upstream cascade.

this is *only* possible by augmenting the 6600 scoreboard system and it represents a significant innovation in its own right, solving what has otherwise been a completely intractable problem associated with SIMD (SIMD on top of standard regfiles, that is), for many many decades.

the hierarchical cascading generation of hazards basically says, "if you need to reserve byte 3 of register r4, please ensure that you also reserve the *WORD* that is used *AND* reserve the *DWORD* of register r4 as well".

it's really that simple [and only possible with a scoreboard].

clearly the read-reservation (read hazard) is on the whole of the register, even if a part of it is required, whereas the write hazard can go down to 32-bit, 16-bit and 8-bit fragments.

with that structure in mind, the subdivision into 4 *32-bit* banks, plus byte-level write-enable lines, starts to make a bit more sense.

l.

lk...@lkcl.net

unread,
May 3, 2019, 8:52:17 PM5/3/19
to RISC-V HW Dev, lk...@lkcl.net, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
florian, hi,

re-reading mitch's book chapters (again...) i recall that you asked "what is the "Issue" signal? is it "really instruction issue" or... what?  the answer is, there is a *global* "issue" flag that is only ASSERTed when there is no Write-after-Write Hazard and the relevant FU for the current instruction is not in the "busy" state, which creates a suite of "Issue_{insert_FU_name}" signals that are individually AND-enabled from this global "issue" flag with pattern-matching from the instruction decode on a per-opcode basis.

in subsequent chapters where you then see Function Unit Dependency Cells with the word "Issue" associated with them, i *believe* that the word "Issue" is (unfortunately) an abbreviation for the appropriate "Issue_{Insert_FU_name}" signal associated with that particular FU-related Cell.

i didn't previously spot the name-reuse, apologies.

l.

2019-05-04_01-38.png

lk...@lkcl.net

unread,
May 3, 2019, 9:30:40 PM5/3/19
to RISC-V HW Dev, lk...@lkcl.net, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
just got some additional insights from mitch. btw "Design of a Computer, Thornton", is here:
http://ygdes.com/CDC/DesignOfAComputer_CDC6600.pdf
https://archive.org/details/cdc.6600.thornton.design_of_a_computer_the_control_data_6600.1970.102630394_201802
------------------


In the CDC 6600::The Global issue signal causes the instruction fetch/decode pipeline to advance.
The Global issue signal ANDED with the <empty> FU decode (image) causes the
issued instruction to be latched in the Function Unit and in the Computation Unit.


Also note: in the CDC 6600, due to the use of latches, if the instruction is not issued,
the fetch/decode pipeline advances but recognized the instruction was not issued.
The advanced instruction remains undecoded, and a mux redecodes the advanced
instruction until issue is successful.
See Thornton page 122.

lk...@lkcl.net

unread,
May 4, 2019, 12:57:52 AM5/4/19
to RISC-V HW Dev, lk...@lkcl.net, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
On Sat, May 4, 2019 at 2:41 AM Mitchalsup <mitch...@aol.com> wrote:
> Note: I was describing how CDC 6600 did it.

noticed.  in the 2nd book chapter, where you describe Concurrent Pipeline Units (an array of latches designated "Function Units" that happen to feed to the same pipeline / ALU), "busy" does not exactly have the same meaning.

however, i believe i am correct in thinking that for a FSM-based div unit (for example), "busy" *would* have the same meaning as in the 6600, i.e. once the FSM was active, it's really important not to "issue" more instructions to that "FU".

whereas with a Concurrent Unit, there's (for example) 4 "Function Units" (say) on the same pipeline, and as long as it never stalls and as long as it is 4 stages or less, *at least one* of those 4 Function Units (entry-points to the same pipeline) is never going to give a "busy" signal.

so "busy" has the same meaning... only an ordinary (ALU) pipeline never *will* be busy... all quite odd :)


> There may (hint: MAY) be better ways with today's technology (clocked flip-flops,

finding it very awkward to create an SRLatch, almost certainly going to have to do a flip-flop.

> Verilog design)

nmigen!  python!  modern OO programming!  classes n random stuff that's actually human-readable!

i'll be updating the document strings (and comments etc.) if/as appropriate.

l.

lk...@lkcl.net

unread,
May 4, 2019, 12:49:45 PM5/4/19
to RISC-V HW Dev, lk...@lkcl.net, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
On Sat, May 4, 2019 at 2:48 PM Mitchalsup <mitch...@aol.com> wrote:

> > nmigen!  python!  modern OO programming!  classes n random stuff that's actually human-readable!
> Some of us find gate level schematics readable.

(attached) - yep, it's one of the reasons i'm happy with nmigen.  one of its options is to generate yosys "intermediary language" files, which are barely above gate-level (cell "block" level to be more precise).

running various optimisation and transformation commands will get you even further down the rabbit-hole, getting closer to an actual ASIC netlist whilst losing information such as human-readable net names as a side-effect.

attached "graphviz" screenshots (yosys "show" command) are at the "still-readable-phase", for the FU-Reg Dependency Cell and also the FU-FU Cell, showing that issue input (issue_i) is gating op1, op2 and dest, as well as wr_pending and rd_pending

go_read and go_write are connected to the "reset" side of the sr-latches.  clock is *not* included in any of these... because it's used synchronously inside the sr-latch cell instead.  so... not exactly the original, yet close enough.

-----

unfortunately, it's not perfect: there's no control over how graphviz automatically lays out the connections: it is however the closest most convenient way to get to an approximation of full gate-level design without either having to write any software to do so, whilst also being able to stick to an actual HDL.

the general development technique of writing code, saving, running a tool and viewing its output visually is one that is i feel significantly underappreciated, yet anyone familiar with texstudio and openscad uses without realising quite how useful the technique is.

currently i am training the team "if graph not readable or otherwise understandable, it's a bug".  that way the whole design gets split down to manageable, reviewable chunks.

l.

2019-05-04_17-13.png
2019-05-04_17-14.png

高野茂幸

unread,
May 4, 2019, 8:53:15 PM5/4/19
to lk...@lkcl.net, RISC-V HW Dev, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
Hi Luke-san,

Off topic:
I have doubt that open source tools can not have maintananceability and serviceability.
But, wait, even if company close their work, who does these, so will face to same problem.
Competitor will work, but price will be increased because of decrease of competitors.
Rather than company does with closed code set, open software tool remains opportunity to succeed after leave of owner, it is healthy.

I use chisel3.0 for my project, it generates so many intermediate wires, therefore, trace-check the generated code takes unnecessary time. I will try using the nmigen, thank you for your info.

Still now, use of product tools is common sense. It is ok when company work, they have a responsibility for users under license. Ordinary users do not care of earth quake, and spend every day. Means difficulty of explain the moving shift to open source, especially companies of potential user.

Best,
S.Takano

2019年5月5日(日) 1:49 <lk...@lkcl.net>:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

lk...@lkcl.net

unread,
May 4, 2019, 10:24:06 PM5/4/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch


On Sunday, May 5, 2019 at 1:53:15 AM UTC+1, adaptiveprocessor wrote:
Hi Luke-san,

Off topic:
Rather than company does with closed code set, open software tool remains opportunity to succeed after leave of owner, it is healthy.


indeed.
 
I use chisel3.0 for my project, it generates so many intermediate wires, therefore, trace-check the generated code takes unnecessary time.

nmigen is not perfect: some auto-generated intermediary wires do exist, however as long as you keep the module to a reasonable size, it is possible to read the auto-generated code and still follow it.

however... i honestly found that outputting to yosys "ilang" format, then using yosys "read_ilang {filename.il}; show top" was far more productive and useful than reading the actual (auto-generated) verilog.

 
I will try using the nmigen, thank you for your info.

please do not think that it is a perfect solution by any means: our team chose it "on balance".

* code readability was considered extremely important [i *genuinely* cannot understand scala code, and i have been programming for 40 years]

* python, by contrast, is now in the top 3 world-wide programming languages.

* using python to *generate* Verilog has huge advantages: it brings the OO capabilities of the entire python world to hardware design.

* MyHDL is great... except it actually transforms python *syntax* into verilog.  hence, if there are concepts that do not exist in verilog, they are *NOT* possible to do in MyHDL.  class-based OO for example is an absolute nuisance in MyHDL.

* pyrtl (another python-based HDL) just does not have the same (large) community.  litex (by enjoy-digital.fr), minerva (an rv32 core in nmigen), these are all not small code bases.


the caveats:

* if you do not know what you are doing in python, you can get yourself into an awful, awful mess.

* you absolutely MUST follow best standard SOFTWARE development practices for python.  pep8, docstrings, unit tests, and more.

* unit tests are ABSOLUTELY ESSENTIAL.  if you cannot accept this (think you know better, "i'm a good programmer, i don't need to write unit tests"), you will take ten times as long as anyone else... that's if you succeed at all.

* i have to be diplomatic here: the lead developer of nmigen is brilliant and conscientious in the design of nmigen, focussing on maintainability and stability of the API.  the regression test suite for example is exemplary and a *really* good example of how a really, really good python project should be developed.  at the same time... it has to be pointed out that, sadly, that they are fundamentally disrespectful of standard python development practices that have been long-established well before nmigen existed... *and blame python for it*

this latter has made it extremely difficult for our team to fully deploy python OO techniques in the development of our codebase, to the point where we may actually have to fork nmigen in order to deal with some of the issues.

the majority of nmigen users are *hardware* engineers, *NOT* experienced python *SOFTWARE* engineers, and, as a result, those engineers are being taught *extremely* bad python development practices because this is often literally their first major python project (massive predominant use of python wildcard imports, for example, has been a "No" in the python world for 2 decades)

perhaps if there was a larger adoption of nmigen by experienced python software engineers, there would be more people speaking up.  whitequark _does_ listen to reason [in some instances].


Still now, use of product tools is common sense. It is ok when company work, they have a responsibility for users under license. Ordinary users do not care of earth quake, and spend every day. Means difficulty of explain the moving shift to open source, especially companies of potential user.


yes.  this has been a common recurring theme for a good couple of decades in the "business" world.  i first heard words like this back in 1999.  they are an excuse.  there is nothing that can be done, you just have to let those businesses succeed / fail based on market competition.

l.

高野茂幸

unread,
May 5, 2019, 2:33:17 AM5/5/19
to lk...@lkcl.net, RISC-V HW Dev, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
Dear Luke-san,

No problem because my recent languages are Japanese, English, Verilog, and Python :)
And my background is processor development from specification decision to back annotation.

Thank you for your detailed advices.

Anyway, is there a homepage listing high level language to verilog translation tools both of closed and opened source?

Thanks and Regards,
S.Takano


2019年5月5日(日) 11:24 <lk...@lkcl.net>:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

Samuel Falvo II

unread,
May 5, 2019, 11:44:27 AM5/5/19
to Luke Kenneth Casson Leighton, RISC-V HW Dev, Dr Jonathan Kimmitt, mitch...@aol.com, Florian Zaruba
On Sat, May 4, 2019 at 7:24 PM <lk...@lkcl.net> wrote:
> however... i honestly found that outputting to yosys "ilang" format, then using yosys "read_ilang {filename.il}; show top" was far more productive and useful than reading the actual (auto-generated) verilog.

I actually had to do this just last night to discover what I consider
to be a bug, but I'm sure is just a misunderstanding on my part.

If I have nmigen code like this:

ctr = Signal(COUNTER_WIDTH, reset=DEFAULT_VALUE)
m.d.sync += output_port.eq(ctr)

without further qualification or use of `ctr`, then it seems that
nmigen will yield a proper, constant-valued register which simulates
properly in Verilator, but which will be treated as an *anonymous
module input* when used with formal verification. This latter
distinction is important, because the formal prover software will
freely fuzz inputs during k-induction proofs unless somehow
constrained with Assume statements. But, in this particular context,
assumptions would be the wrong thing; I'm *specifying explicitly* what
I want in the HDL itself. So, how to fix this?

The work-around for this, I've found, is easy to apply: you just force
the register to equal itself, like so:

ctr = Signal(COUNTER_WIDTH, reset=DEFAULT_VALUE)
m.d.sync += [
output_port.eq(ctr),
ctr.eq(ctr),
]

This is enough to trick the ilang backend into realizing that ctr is,
in fact, driving *something*, and so is a real register, and not an
input port.

However, I *never* would have found this were it not for looking at
the generated schematic. Because, in the schematic, if ctr appears in
an octagon, you *know* it's treated as an I/O port.

DISCLAIMER: You probably won't run into this edge case because most
designs don't instantiate fixed-value registers like this. I,
obviously, intend on replacing this with more complex logic later on.
But, since I'm very early in my development cycle, I don't yet have
everything I need to do so, I "stub" out a portion of my circuit with
a hard-wired register, like this.

Still, it meant the difference between formal properties never being
met versus formal properties which properly match my happy-path unit
test behavior.

> * unit tests are ABSOLUTELY ESSENTIAL. if you cannot accept this (think you know better, "i'm a good programmer, i don't need to write unit tests"), you will take ten times as long as anyone else... that's if you succeed at all.

If I can add something here:

Not only are unit tests essential, the whole test-driven development
process would, in my opinion, be essential. It's far too easy to
write code which doesn't have adequate coverage otherwise. *More* of
my time yesterday was invested into researching why my formal
properties *didn't* fail in the expected way than I spent in making
them pass. I found at least as many bugs, if not more, this way than
just making the bare minimum amount of tests pass.

Also, if your tools support it, I'd advocate using formal methods in
conjunction with unit tests. Contrary to what many think, I find they
*compliment* each other. Formal is great for proving fundamental
properties of a circuit, while unit testing is great at proving more
sophisticated or stateful interactions. The same unit test framework
is also used for integration testing as well. I approach this by
follow this crude procedure:

1. Write a unit test for some feature I desire. Make sure this unit
test FAILS. If it doesn't fail, you either duplicated work somewhere,
or some other assumption about your code is wrong. Re-evaluate your
code and requirements to make sure they agree with each other.

2. Write just enough nmigen code to make the test pass.

At this point, I have a circuit that fulfills my requirements so far.
However, this doesn't mean that the circuit is *correct*. It just
means that it (perhaps by accident) fulfills my documented
requirements.

3. Write formal properties which covers the new behavior you just
tested. These will be a subset of the formal properties of the
circuit. And, in some sense, you're duplicating work. But it's
essential this be done, because it also helps constrain the formal
prover's search space.

4. Write additional formal properties that covers various edge cases.
For example, in my TileLink code, I always separate address from data
phases in time, so a_valid and d_ready are never asserted at the same
time. I also have properties to ensure proper sequencing as well, so
that d_ready always asserts *after* a_valid negates, or that both
a_valid and d_ready are both negated after reset. Etc. Here's the
biggest pay-off from using formal verification in conjunction with
unit tests. It not only saves you from having to write a ton of
negative unit tests, it does a far better job of finding the edge
cases for you than you ever could (think of k-induction proofs as
guided input fuzz testing). This leaves your unit tests to (more or
less) happy-path only code, which is always much easier to read and
use as examples (in the sense of, "How do I do ...?") for
documentation later on.

5. Commit your code with full confidence that it meets your
understanding of the requirements at that time, then repeat as
necessary. Each iteration of this loop will result in a design being
refined over time until it finally meets the product requirements as a
whole. NOTE: You might even find cases where the requirements are
inconsistent, and will need to alter them before proceeding. But,
then, that's software engineering in a nutshell, and probably all of
engineering in general anyway.

lk...@lkcl.net

unread,
May 6, 2019, 2:00:46 AM5/6/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch


On Sunday, May 5, 2019 at 4:44:27 PM UTC+1, Samuel Falvo II wrote:
On Sat, May 4, 2019 at 7:24 PM <lk...@lkcl.net> wrote:
> however... i honestly found that outputting to yosys "ilang" format, then using yosys "read_ilang {filename.il}; show top" was far more productive and useful than reading the actual (auto-generated) verilog.

I actually had to do this just last night to discover what I consider
to be a bug, but I'm sure is just a misunderstanding on my part.

If I have nmigen code like this:

    ctr = Signal(COUNTER_WIDTH, reset=DEFAULT_VALUE)
    m.d.sync += output_port.eq(ctr)

it kiinda makes sense if you're aware that nmigen auto-detects what is inputs, what is outputs, and what is registers.  i agree that an explicit marking would be helpful.

looking at the code (grep -r register nmigen/*) it's clearer what's going on: the code detects if the signal was ever used in a sync.  if "yes", it's a register.  it seems to be that simple.

however... really, this should be raised as a bug, for a full investigation.
 
However, I *never* would have found this were it not for looking at
the generated schematic.  Because, in the schematic, if ctr appears in
an octagon, you *know* it's treated as an I/O port.

really? oh :)  i guess i sort-of inferred that (from looking at a *lot* of diagrams), good to have some words.
 

a_valid and d_ready are both negated after reset.  Etc.  Here's the
biggest pay-off from using formal verification in conjunction with
unit tests.  It not only saves you from having to write a ton of
negative unit tests, it does a far better job of finding the edge
cases for you than you ever could (think of k-induction proofs as
guided input fuzz testing).  This leaves your unit tests to (more or
less) happy-path only code, which is always much easier to read and
use as examples (in the sense of, "How do I do ...?") for
documentation later on.


that's very interesting and insightful, and something that hadn't occurred to me before, and that's from 25 years of software engineering.

i *thought* i was applying full software engineering practices to this [hardware] project, having originally detested unit tests i now know from a lot of [bad] experience that they're essential.  what had *never occurred to me* is:

(1) quite how much formal proofs could make unit tests that much less of a pain in the neck and

(2) forget hardware for a minute: why the _hell_ is the significance of formal proofs not more widely known in the *software* engineering unit test world??

oink [1]

so, thank you samuel.

l.

[1] "oink".  a word i encountered from the comedy author Tom Holt.  it's a good word.  it's a "wtf, does not compute" moment.

lk...@lkcl.net

unread,
May 6, 2019, 2:08:20 AM5/6/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
mitch, hi,

just a clarification, see attached modified version of the augmented Load Function Unit (11.4.8, page 32).  the original contains an AND gate on the "Set" to an SR-Latch, so it's:

* Set: INT/FP Regfile Select AND Issue
* Reset: Issue

that means that there's the possibility of both Set and Reset being asserted simultaneously (give-or-take a nanosecond, through the AND gate).  in random online (highly reliable of course) searches describing SR Latches, i came across mention that NOR-based SR Latches reeally don't like to have both set and reset asserted, they go unstable.

would it be better to have two SR Latches, one for FP and one for INT, and, given that the selection of the regfile to use is mutually exclusive, use a single wire i.e. have a NOT gate to select one or other of the INT/FP?

l.
2019-05-06_06-37.png

lk...@lkcl.net

unread,
May 6, 2019, 9:13:41 PM5/6/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
mitch, hi,

very useful to have this clarification.  unfortunately i made the mistake of not saying exactly which diagram i was referring to.  which caused you to (very usefully!) describe the INT Function unit (page 31), where the (part-)diagram that i posted was of the Load Function Unit (page 33).

the LOAD Function unit *does* have (and selects from) FP and INT (only one at a time, obviously), where the diagram assumes (reasonably, this being RISC) that the FP regnums and INT regnums from the instruction are in the same bitspace.

so from the diagram as-is, my concern was that the regfile selector (marked "FILE" in the attached) would cause SR-Latch instability when "File" is ASSERTED.  another potential way to alleviate those concerns i believe would be with a NOT gate and a 2nd AND gate? (attached small similarly badly-drawn diagram).

apologies for the confusion.

l.

On Mon, May 6, 2019 at 6:08 PM Mitchalsup <mitch...@aol.com> wrote:

> Let me clarify:: What you see in the drawing is one function unit and in this case we can see on the
> right hand side that this is the INT function unit. So the Issue, GO_Read and Go_Write should really
> be Issue_INT, Go_Read_INT, and Go_Write_INT. The FPU will have a similar one covering all the
> FP registers. {ANd of course: Busy_INT.} You don't comingle register data-flow INT<->FP.
>
> Secondly: ISSUE_INT, Go_Read_INT, and Go_Write_INT are unarily assertions.
>
> So: Go_Write_INT clears the "I have an Instruction" SR-flop enabling a ISSUE_INT in some future
> cycle.
>
> Issue_INT: sets the "I have an Instruction" SR-flop which drives the Busy_INT signal.
>
> Go_Read_INT: sets the "Operands have arrived" SR-Flop and signals Writeable
>
>  Go_Write_INT clears the "I have an Instruction" SR-flop enabling a ISSUE_INT in some future
> cycle. And the cycle repeats.
>
> Note: This is "the" degenerate Function unit that knows each of its instructions are 1 cycle.
>
> The Go_Read_X and Go_Write_X are asserted by Pickers (FF1s) on clock boundaries.
>


2019-05-07_02-03.png
2019-05-07_02-09.png

lk...@lkcl.net

unread,
May 7, 2019, 4:31:07 AM5/7/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch


On Tuesday, May 7, 2019 at 2:13:41 AM UTC+1, lk...@lkcl.net wrote:
 
the LOAD Function unit *does* have (and selects from) FP and INT (only one at a time, obviously), where the diagram assumes (reasonably, this being RISC) that the FP regnums and INT regnums from the instruction are in the same bitspace.

so from the diagram as-is, my concern was that the regfile selector (marked "FILE" in the attached) would cause SR-Latch instability when "File" is ASSERTED.  another potential way to alleviate those concerns i believe would be with a NOT gate and a 2nd AND gate? (attached small similarly badly-drawn diagram).

accidentally encountered the answer (page 20, section 10.4.8), from examining the section on the 6600 register file:

"Those skilled in the art of logic design..."

[...clearly not me...]

"... will notice that this is not a typical D-Type
flip-flop (as is the norm in modern times). This flip-flop is being reset and set
simultaneously, driving both NOR gate outputs low. The signal on the set gate persists
for two gate delays longer than it persists on the reset input, so if there is a 1 to be written
into the cell, it will be captured by the set-clear flip-flop. It persists for two gate delays
because the AND gate is a NAND gate followed by an inverter."

soOoo... the instability that would normally occur by driving S and R simultaneously does not occur.... because there is a 2-gate latency in the AND gate.  with the "Set" wire simultaneously going into *both* the Set *and* through the 2 gates of AND, it is a physical impossibility for Set and Reset to be asserted at precisely and exactly the same time.

that's.... horrendously obscure, efficient and elegant, all at the same time.

l.

Jacob Lifshay

unread,
May 7, 2019, 4:33:49 AM5/7/19
to Luke Kenneth Casson Leighton, RISC-V HW Dev, Dr Jonathan Kimmitt, mitch...@aol.com, zar...@iis.ee.ethz.ch
Sounds exactly like a nightmare for a modern HDL toolchain

lk...@lkcl.net

unread,
May 7, 2019, 5:23:21 AM5/7/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
On Tue, May 7, 2019 at 9:40 AM Dr Jonathan Kimmitt <jr...@cam.ac.uk> wrote:
>
> This design style is deprecated nowadays, along with valves and uni-junction 
> transistors. 

 [you heard the story about why the Russian Army still uses valves in their tanks, right?  everyone thinks it's because they're resistant to EM spikes from a nuclear attack.  the real answer is much more mundane and hilarious: supply logistics.  it would be cheaper for Russia to design an entirely new tank than it would be to maintain the bureaucracy required to track the conversion of the entire Russian tank fleet... :) ]

> The reason being that such techniques are not scalable to smaller technologies, 
> which is/was the surest way to get a performance boost until the 4GHz wall was hit.
> However if any timing dependence can be isolated within a standard cell, 
> it is relatively easy to characterise whether the pulse width or other criteria
>  is met within a post route timing analysis engine. But if someone forgets to 
> check the obscure requirement in the next cost-reduction iteration, you are in trouble. 
> Because 75% or more of delays are in routing, it is highly desirable to have an easy
> way to scale gate sizes to meet design rules without impacting the overall timing flow.

i understand a tiiny bit about this, now, after talking with Jean-Paul from lip6.fr (alliance / coriolis2 maintainer).  he explained that the cell libraries are (to a reasonable approximation of "scalable")... scalable.  design one cell, run a script, it creates near-as-damnit the same thing, independent of geometry.

what surprised me was that the scalability of cells (given a good enough editor) is even possible *at all*.  i honestly thought that there would be something that remained a fixed size.  JP confirmed that, no, everything (near-as-damnit) scales linearly.

it does have to be pointed out that the size of the 6600 was... well... it filled a room, and the gates were hand-soldered 3-leg transistors on PCBs around (i think...) the 6-8 inch range? and plugged into backplane buses where a ring bus travelled round the entire room.  tricks like using the gate delay latency were.... ok in their context.

l.

lk...@lkcl.net

unread,
May 7, 2019, 5:25:06 AM5/7/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
let's not use this, eh? :)

does anyone (dr kimmett?) have a recommendation for a modern replacement for this AND-gate plus SR Latch?

l.

lk...@lkcl.net

unread,
May 7, 2019, 6:56:15 AM5/7/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
On Tue, May 7, 2019 at 10:34 AM Dr Jonathan Kimmitt <jr...@cam.ac.uk> wrote:
>
> The reasons why things were done in a certain way in the
> past to minimise gate count are mostly invalid now. 
>But if you can capture the essence of the best ideas in a formal
> executable specification, this can be maintained forever as it is purely
> based on ideal mathematics and not on pragmatic gate count or
> electrical performance considerations. 
> Translating that specification into a micro-architecture is where the
> arguments begin about the best methodology, and the best performance.

 i would find it hard to justify, to our sponsors (mainly NLnet) to use such an indirect development technique... however we have a compromise that is reasonably acceptable and happens to actually generate useable HDL.

 the approach that is currently underway is:
 
* to use nmigen to write the code in an easily-reviewable and understandable format (python being the top 30%+ most used programming language, now)

* to write *formal mathematical proofs* - again in nmigen - on every single one of the components, not just the overall engine

* to use yosys (symbiyosys) to run those formal mathematical proofs through BMC as well as k-induction engines

we are just negotiating with a second sponsor (a Chinese ODM, believe it or not) for sufficient funds to pay an extremely experienced mathematician and programmer from Cambridge, who by sheer luck and happy coincidence happens to have patents in the area of... formal mathematical proofs!

based on what samuel described a few days ago [an epiphany moment for me], his time and cost can easily be justified (the Chinese ODM would like us to accelerate our timescales somewhat), on the basis that formal proofs catch fundamental errors and save vast amounts of time by covering in short order what would otherwise require dozens of potentially haphazard and incomplete unit tests that *don't actually do the full job*.

this kind of thing - a formal proof of an out-of-order parallel execution engine - being precisely the kind of really exciting challenge that someone with a brain the size of a planet would really really enjoy tackling.


> For example I would always maintain that an FPGA should be used to 
> prove the micro-architecture, because of the huge number of cycles 
> that can be executed compared to a simulator.

 and you never know, in a simulation, if it properly emulates gate-level timings...

> But somebody else might say SPICE should be used as it gets 
> the best performance out of every individual transistor.

if this design is relying on something that needs transistor-level simulation, i'm doing something wrong.

l.

lkcl

unread,
May 7, 2019, 6:50:00 PM5/7/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
On Tue, May 7, 2019 at 4:50 PM Mitchalsup <mitch...@aol.com> wrote:

> What we know is that we have a Load or a Store being issued, and that there is a signal (FILE)
> that is going to tell us if it is FP (FILE=0) or INT (FILE=1). The memory reference has to target
> one file or the other. The logic I show will assert INT after 2 gates of delay of FP after 1 gate of
> delay (in the SR-flop). Your added logic does not harm, but I don't see it adding good, either.
>
> Also Note: FILE is only used if ISSUE is asserted.

 yes, it's asserting the selector AND gates through the other SRLatch
(the one with Go_Write in it).


On Tuesday, May 7, 2019 at 9:31:07 AM UTC+1, lkcl wrote:
soOoo... the instability that would normally occur by driving S and R simultaneously does not occur.... because there is a 2-gate latency in the AND gate.  with the "Set" wire simultaneously going into *both* the Set *and* through the 2 gates of AND, it is a physical impossibility for Set and Reset to be asserted at precisely and exactly the same time.

that's.... horrendously obscure, efficient and elegant, all at the same time.


from Mitch:

That is how gate logic works, and back in the day we had to be efficient with them, 
we invented and learned all these tricks. Today, you can slather gates around as if 
they have low cost, we did not have that luxury.

The NOR-NOR SR-flops have the property that if both inputs are asserted, both outputs are low.
Prior to ISSUE, and after Go_Write, Buys is low and thus the register decoder is 32'0, so there
are no assertions.
No assertions prior, no assertions during, one assertion afterwards seems pretty easy to verify.

lkcl

unread,
May 8, 2019, 10:56:31 PM5/8/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch

Hi Mitch,

Ok so I have implemented the gate diagrams to what I believe is a reasonably accurate degree, connected them up with a simple 2 op combinatorial ALU, and some preliminary instructions actually execute and do something (something right, that is)

What I found is as follows:

* the combinatorial nature of the continuous scoreboard causes the simulator to freak out and lock up under certain circumstances.

* this turns out to be related to how SR Latches can't correctly be simulated by nmigen (can't be identified as register-like, nmigen relying on combinatorial blocks being directed acyclic graphs connected only through - separated by - synchronous blocks using DFFs)

* the removal of write-through capability on the register file stops one of the loops (and interestingly still allows ALU operations to proceed combinatorially on the same clock)

* a second loop was tracked down to the "readable" vectors that go into the Issue Unit. this detected when the destination register is also one of the source registers. confusion arose initially as the write thru was *also* causing lockup.

* allowing the "readable" vector to be sync-delayed by 1 clock cuts one of the cyclic points in the combinatorial loop.

* a third point has yet to be tracked down, shown up by throwing random instructions at the engine. i believe it is when a destination reg is used as both a src and dest in a subsequent instruction, this still has to be confirmed and tracked down.

The exercise has allowed me to identify that I had not noticed in all prior readings that the latter stage augmentations, the Continuous Function Units which produce RD/WR vectors, and associated Issue Unit, actually *replace* the 6600 FU - FU Matrix and the secondary FU - Regs Matrix (aka an unary variant of the 1D 6600 Q Table).

This being justifiable by a significant reduction in gate count (16k gates down from 48k).

The issue that I have is, in the continuous (lower gate count) version, I can't identify what constitutes the Q Table (or a variant of it), and that is a crucial thing that provided register renaming, which as you know I plan to extend with historical entries and thus implement precise register renaming.

Can't do that if I can't find it :)

What's the scoop, here?

L.

lkcl

unread,
May 9, 2019, 10:21:04 PM5/9/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch


On Thu, May 9, 2019 at 5:05 PM Mitchalsup <mitch...@aol.com> wrote:

> * this turns out to be related to how SR Latches can't correctly be simulated
> by nmigen (can't be identified as register-like, nmigen relying on combinatorial
> blocks being directed acyclic graphs connected only through - separated by - 
> synchronous blocks using DFFs)
>
> You can replace the SR-flops with JK-flops; or D-flops since the logic is so simple.

 i quite like the SR-flops, enough to have pseudo-implemented them as a hybrid sync/combinatorial... something.  if S, an internal register is set and a combinatorial-output set.  else if R, the internal register is cleared and the output also cleared.  else, the register is combinatorially output.  it seems to work well, and yosys, fascinatingly, turns it into a DFF with 2 MUX gates.

 i'll see how it goes with a JK.

 i've managed to get actual repeatable reliable output for any arbitrary input, by putting in strategic sync's.  however there's a couple of caveats:

 * instruction issue has to wait, and can only occur every 3rd clock cycle (whoops), this presumably because the sync-points (which were mostly on the transfer of the read-write vectors) take time to propagate

 * i've not yet added in an ALU that requires more than 1 cycle to complete.  i'll do that today (a pseudo-MUL with deliberate extra delay), to see what happens.


> > * the removal of write-through capability on the register file stops one of the
> > loops (and interestingly still allows ALU operations to proceed combinatorially 
> > on the same clock)

> A layer of forwarding should alleviate this.

 will see how that works out

> > The issue that I have is, in the continuous (lower gate count) version, I can't identify
> > what constitutes the Q Table (or a variant of it), 

> XBA is the logic that connects "I read this register" with "I write this register" in a RAW sense.
> The write enables that FU to assert Readable.
>
> Q is the logic that connects "I write this register" with "I read this register" such that all prior
> reads can occur before the write (WAR). 

ok, so, checking i got that right: this is the *alternative* version (named Continuous Scoreboard, so not the original 6600):

attached annotated diagram (p31, section 11.4.8), the global vector(s) come in (these are the ORed input from all other Function Units to combine to create a global "someone Reads" and global "someone Writes"), and are ANDed with the *local* opposite.

 * local FU read ANDs with *global* FU write vector to create RAW for *THIS* FU, resulting in "Readable" signal for *THIS* FU
 * local FU write ANDs with *global* FU read vector to create WAR for *THIS* FU ditto Writeable

so... Q's not in the FUs themselves...

> Thus, "I Read this register" is used to deny the
> <attempting> writer of the newer value to that register. So Q is where you find the Readables
> NANDing the writables.

... the signals *from* the FUs go elsewhere to make up Q.

ok, so let's trace it through.  the Readables and Writables all go to the "Priority Picker", to create one and only one Go_Read and Go_Write [this being the way to ensure that, at the Register File, there's no contention for Register File ports.  want more instructions to be executed simultaneously? add more ports and make sure that a corresponding "Priority Picker" is also added.. *and* more Function Units to feed it].

so, (see 2nd diagram, p34) the FUs (after routing through Priority Picker(s)) have the Go_Read and Go_Write vectors connected directly to the Register File, which is *no longer an SRAM*, it's an array of DFFs with an array of individual REn/WEn lines... that happen to be exactly and precisely... the Go_Read / Go_Write vectors.

in this 2nd diagram, what was formerly two separate matrices, one FU-FU and one FU-Reg, they're now *combined*, replaced with an array of FUs that create FU RD/Write "vectors" (so it is still effectively an FU-Reg Matrix).

this would *tend* to suggest that Q is inside the (new) FU... except that it's not.

what i believe i am looking for is the vertical unary array of flip-flops (one per register), which are illustrated in section 10.5 p24, those being the equivalent of Q table entries (see 3rd attachment, "To a great first order this arrangement is equivalent to the [FU-Reg] Dependency Matrix")

this unary array per FU (or, in the case of the original 6600, binary-representations aka "just the Destination Reg #") is i believe *missing* from the Continuous Scoreboard Function Units described in section 11.4.8.

annotated in red is a latch on the incoming "Dest Reg #" which i *believe*, if added in (along with associated circuitry for detecting when it should be set / reset), would give the register-renaming capability.

honestly though i am slightly lost, and may have to go back to the dual Matrices version (independent FU-FU and FU-Reg), which would be quite annoying as it would mean having to work out how to add the "Shadowing" and a few other things besides.

ngggh! :)

l.
2019-05-10_01-59.png
2019-05-10_02-32.png
2019-05-10_03-05.png
2019-05-10_03-09.png

lkcl

unread,
May 11, 2019, 3:27:36 AM5/11/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch


On Fri, May 10, 2019 at 10:52 PM Mitchalsup <mitch...@aol.com> wrote:
 
ok, so let's trace it through.  the Readables <0:31> and Writables  <0:31> all go to the multiple "Priority Pickers", to create one and only one Go_Read and Go_Write per picker (there are 8 pickers) 

which in the original 6600 doesn't mean 16R8W, because there's 3 separate regfiles. found the page... p70 Figure 49 in Thornton, "Data Trunks".  hmmm, one of the banks were 2R2W, the other two were 2R1W.  not my primary focus, however.

[this being the way to ensure that, at the Register File, there's no contention for Register File ports.yes  want more instructions to be executed simultaneously? add more ports and make sure that a corresponding "Priority Picker" is also added.. *and* more Function Units to feed it].right

this will be the key behind how vectorisation will work in the Libre RISCV SoC.  the issue engine will be multi-issue and will throw several elements per clock at as many Function Units (with associated regfile ports) are required to get the desired performance.

to ensure that the number of Function Units does not get completely out of hand (with associated port-proliferation), we will transparently group sequentially-numbered elements into SIMD batches.

SIMD at the back-end, Vectorisation at the front-end.


so, (see 2nd diagram, p34) the FUs (after routing through Priority Picker(s)) have the Go_Read and Go_Write vectors connected directly to the Register File, which is *no longer an SRAM*, it's an array of DFFs with an array of individual REn/WEn lines... that happen to be exactly and precisely... the Go_Read / Go_Write vectors. Yep, everything is timed from the pickers.

in this 2nd diagram, what was formerly two separate matrices, one FU-FU and one FU-Reg, they're now *combined*, replaced with an array of FUs that create FU RD/Write "vectors" (so it is still effectively an FU-Reg Matrix). probably, I haven't given this any thought for over a decade.

:)
 

this would *tend* to suggest that Q is inside the (new) FU... except that it's not.

what i believe i am looking for is the vertical unary array of flip-flops (one per register), which are illustrated in section 10.5 p24, those being the equivalent of Q table entries (see 3rd attachment, "To a great first order this arrangement is equivalent to the [FU-Reg] Dependency Matrix") This should be the middle figure in chapter 10 on page 25 it is amalgamating the write waits of the figure above.

for benefit of other readers: that's simply a couple of big OR gates (per row), taking in (each of) the read and write dependencies (which are on registers), and outputting a per-function-unit "FU Read Dependencies" signal and a per-function-unit "FU Write Dependencies" signal.

thus, logically, the Function Unit may *start* when the read dependencies are gone (the output of its big read OR gate goes low), and it may *commit* (to the regfile) only when the write dependencies are gone (the output of its big write OR gate goes low).

[correction: thanks to the priority-pickers, be given the *opportunity* to start and the *opportunity* to commit]


this unary array per FU (or, in the case of the original 6600, binary-representations aka "just the Destination Reg #") is i believe *missing* from the Continuous Scoreboard Function Units described in section 11.4.8. When I did this (a decade ago) I argued this point until I realized it was already being covered by existing logic. I can't put my finger on exactly where. But you WILL stumble across it.

i have a vague feeling that initially concurred with this assessment.  the similarity between the Function Unit's Go_Read and Go_Write latches, and those of the FU-Regs, is compelling:

* they both have the same (binary-to-unary) register decoding
* they both have ANDing on the (same) register vectors
* they both have (identical) big OR gates on their respective output.

the key differences being:

* the FU Go_Read and Go_Write latches do not take in the clock, where the FU-Reg Dependency cell does
* the Function Unit uses only the one Go_Read latch for both src operands, ORing the results together
* where the Function Unit has the decoded instruction (register #s) come in combinatorially, the FU-Reg Dependency cell *latches* the input (on the clock-plus-issue plus src1/src2/dest respectively), where the Function Unit *LOSES* that information once the instruction (or the "issue" signal) goes away.



annotated in red is a latch on the incoming "Dest Reg #" which i *believe*, if added in (along with associated circuitry for detecting when it should be set / reset), would give the register-renaming capability.
Yes.


then i'll go through it.
 
honestly though i am slightly lost, and may have to go back to the dual Matrices version (independent FU-FU and FU-Reg), which would be quite annoying as it would mean having to work out how to add the "Shadowing" and a few other things besides.Don't give up, you are almost there.


appreciated :)

it occurred to me overnight that the whole thing works because of the three-way revolving door between issue, read and write [those three being impossible to assert all at the same time].

any one Function Unit may respond to "issue" on one cycle, then it may respond to "Go_Read" on another subsequent cycle, then "Go_Write" on another.  there *may* be clock delays in between.

however... i believe this critically relies on there being clock-synchronised latches in the FU-FU and/or FU-Reg Dep-Matrix to capture the src/dest Reg#s (in unary or binary form, it doesn't matter which), at instruction issue time.

that these are missing from the control machine on p38 sect 11.4.8 is prooobably why i had to do one issue every 3 clock cycles (and add some sync-latches to pass the information through).  one clock for issue, one for read, one for write.


with apologies for the large resolution/size of the attached, it shows the FU-Reg Dependency Cell side-by-side with the Function Unit.

* latched-registers are in red on the Function Unit dest/oper1/oper2 Register #s.  these are latched in on "Issue" (assume registers are clock-latched).

* given that FU-Reg Dependency Cells are a horizontal row (each column having a set of Latches per unary-decoded register), the addition of latched-registers to each Function Unit is directly equivalent to a *row* of FU-Reg Dependency Cells (just with unary latches rather than binary registers).

* red arrows show where the corresponding AND gates (assuming a *row* of FU-Reg Dep Cells) result in the direct-equivalent of that (new) latch-register, for each of dest/oper1/oper2.

* the green arrows show the corresponding AND gates which take the (now latched) dest/oper1/oper2 *after* each (latched) register has been decoded from binary to unary.  thus, where the (black-highlighted) FU-RegDep Cell shows one AND gate (one for each register), the Function Unit shows *all* those AND gates.  multiply the FU-Reg Dep cell into a row, and equivalence with the Function Unit diagram is achieved with respect to the green-arrowed gates

[caveat below]

* the cyan arrow is the OR of the dest1/dest2, prior to going into big-OR-thing previously mentioned.


the caveats are:

* the green and cyan arrows (going back to the FU-Reg Dep Cell) ANDs the *UNLATCHED* dest/oper1/oper2 to determine the Readable, Writable flags and the Read_Pending and Write_Pending register vectors.

* thus, the Readable flag and Int_Read_Pending vectors generated by this Function Unit are supposed to DROP (de-assert) when the Go_Read flag is RAISED (asserted), likewise something similar for Writable etc., all of which starts to get a bit hairy and melts my brain as far as trying to keep the original functionality *and* add in merged Q-Table capability.

all of which is prrrooobably why Cray and Thornton originally kept the Q-Table separate.

on balance, then, my feeling is that the safest strategy would be to use the FU-FU and FU-Dep Matrices from section 11.4.7, in combination with the *augmented* Function Unit (with branch shadows added, from p55).

i still haven't yet determined if the FU *and* the *two* Dep Matrices are needed: given that the (augmented) FU from 11.4.8 is near-identical to the 6600 variant (10.4.5), i'm inclined to gravitate towards "yes" for now.

will see how it all goes.

l.
2019-05-11_06-00.png

lkcl

unread,
May 13, 2019, 5:09:22 AM5/13/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
On Sat, May 11, 2019 at 10:48 PM Mitchalsup <mitch...@aol.com> wrote:
it occurred to me overnight that the whole thing works because of the three-way revolving door between issue, read and write [those three being impossible to assert all at the same time].
This is the key, three latches in a row to prevent a run-ahead situation.

and, any given FU doesn't care if it's "slow" (a FSM) because there are *multiple* of them.  because there are multiple of them, the issue unit can throw out one instruction per clock to each of these FSM-like-FUs.

bearing in mind also that *multiple* FUs may be the front-end for *one* pipelined ALU, such that the instructions issued on each clock actually end up in the *same* ALU's pipeline.

in effect, an FU serves the exact same purpose as a single-row'd Tomasulo Reservation Station.


any one Function Unit may respond to "issue" on one cycle, then it may respond to "Go_Read" on another subsequent cycle, then "Go_Write" on another.  there *may* be clock delays in between.
Right, notice the reuse of a FU has a minimum of a 3 cycle delay.

(1) thus my idea of creating a simplified example which only has 2 FUs, and trying to issue them with instructions without checking whether it's safe to do so is a non-starter.

whoops :)

(2) those 3 clock delays are equivalent to a standard single-issue pipeline's "operation decode/fetch, operand fetch, result store" phases (leaving out the actual ALU computation in the middle) so it's all ok, no performance penalties involved.

 
that these are missing from the control machine on p38 sect 11.4.8 is prooobably why i had to do one issue every 3 clock cycles (and add some sync-latches to pass the information through).  one clock for issue, one for read, one for write.
If you are trying to re-use a single FU then you are correct.

i'm not trying to re-use a given FU.  i've missed something.

If you are not trying to re-use a single FU there should not be a problem in issuing on back to back decode cycles.

yes.  i think i'm going to change the example, add 3 FUs (or even 4) and see if it works then.
 
 

with apologies for the large resolution/size of the attached, it shows the FU-Reg Dependency Cell side-by-side with the Function Unit. The only request I have WRT size is to do them in *.jpg

ok, cando.

You may be missing a piece of perspective. Imagine that the FUs are close to the issue section of
the DECODER. and imaging that the CUs are a bus distance away from the issue section. The FUs are clocked at one edge, the CUs are clocked at the successive edge allowing the BUS to be 1
clock long. 

ahh this explains why in Thornton, "major" and "minor" cycle are mentioned.  i'm noticing, there's a *lot* of loops, here.  part of the difficulty is clearly down to synchronisation.  i read a bit about this in Thornton: time-synchronisation is mentioned as being crucial.
 

So the FUs are driving SB logic while the CU are driving computational logic. The FUs always running 1 clock ahead of the corresponding CU.

okaay.  and the FUs issue "busy" signals that loop back to the issue unit, to tell it not to try to allocate that FU a second time.
  

As long as you don't get a WAW and you don't run out of FUs, you should be able to issue at your hearts content.


i ran out of FUs :)
 
* latched-registers are in red on the Function Unit dest/oper1/oper2 Register #s.  these are latched in on "Issue" (assume registers are clock-latched).
Maybe I am missing something::
I was using the decoder enables to enable/disable the assertions onto the local XXX_Pending busses which get ORed onto the Global versions.

i've found that there's a loop, there, as well, which you illustrate on p8/9 of chap10 (images attached for benefit of other readers).  because of the loop, that's definitely going to require a clock-sync.
 
Once a Go_Read_FU is asserted, the read reservation are removed, locally first, then transitively to global.

so are you saying that the FUs can *generate* read/write pending combinatorially, that the ORing can take place combinatorially, however that they need to be sent *back* into the FUs (and also global-write-pending into the Issue Unit) on a one-clock delay?

trying that out, here, it seems to work

Once Go_Write is asserted, the same happens with the Write_Pending signals.

noted.
 
These decoder enables were driven directly from the SR-flops directly from the GO_XXX pickers.
So as long as the FU is not re-used faster than 3 cycles, there is time for all this to transpire.

i still have some debugging to do before i can confirm this.
 

So, are you seeing that Global_INT_Read_Pending is longer than 2 gate delays after INT_Read_Pending? And the same for *Write*?

one.  see attached gtkwave diagram, int_rd_pend_o / int_wr_pend_o are the integer unit read/write pending outputs respectively; g_int_rd_pend_i / g_int_wr_pend_i are the global read/write inputs respectively, and they're just the one clock apart.


Let me ask you a few questions::
1) How long after Issue_INT does INT_Read+Pending get asserted?

immediately (same cycle)
 
2) How long after Issue_INT does INT_Write_Pending get asserted?

again, immediately.
 
3) How long after Global_INT_Read_Pending does Readable get asserted?

never.  which doesn't seem right, at all.
 
4) How long after Go_Read does INT_Read_Pending get deasserted?

go_rd seems to be stuck HI indefinitely....
 
5) How long after Go_Read does Request_Release get asserted?

2 cycles.
 

6) And are the pickers at the end of a clock cycle so that the output of the pickers drives the whole next clock from a read Flip-Flop?


the intpicker output appears to be nonsense, and the 2nd FU appears to have both readable and writable asserted.  that'll be a bug i need to track down.
 
It might be appropriate at this time for you to draw your own schematic.....

i'm using yosys "show" to generate graphviz output, that i then (try to) simplify.  it's messy (auto-generated) yet i am counting on it to be accurate, where my drawings might not be.  and would take time.

attached (laaarge) jpg, i've identified several of the loops.

i also found the section describing what the gate delays should be (10.4.4, p14).

more later when i've investigated, thanks for the hints.

l.
2019-05-13_07-40.jpg
2019-05-13_08-33.jpg
2019-05-13_08-33_1.jpg
2019-05-13_08-58.jpg

lkcl

unread,
May 13, 2019, 8:33:50 AM5/13/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
hang on hang on... p14, section 10.4.4, there's some words saying this about the X-Function Unit:
"Thus a typical Function Unit contains but 59 gates plus 9-bits of instruction storage."

so what i suspected (for about a week) that some latches/storage on the src and dest register numbers would be needed, well, um, it would appear that they're there in the diagram all along.

can i check:

* the src/dest-latches are enabled by the "Busy" signal from the SR-Latch i.e. Qn)
* that's definitely the Function-Unit "Issue" signal (e.g. Issue_Add, Issue_Bool, Issue_Shift)
* the src/dest-latches are definitely *per Function Unit* rather than part of the Issue Unit.

this would be the "Q" values i thought were missing.

the reason i ask is because the corresponding diagrams from Thornton (Figure 76) are a little unclear, and appear to include both the (global) instruction issue flag to set the registers as well as the Issue_XXX flag.  Fi, Fj, Fk - these are i believe binary-form (Function Unit Numbers saying what unit will have a src/dest dependency on result and operand respectively), these i believe you translated to unary, which is why we see the arrays of AND gates.

am i along the right lines?

l.

2019-05-13_13-05.jpg
2019-05-13_13-11.jpg

lkcl

unread,
May 15, 2019, 12:48:04 PM5/15/19
to RISC-V HW Dev, lk...@lkcl.net, jr...@cam.ac.uk, mitch...@aol.com, zar...@iis.ee.ethz.ch
after getting into a mess with the FU-FU and FU-Reg Matrices version, i went back to the 2nd chapter "Function Unit Only" version, the one that simply has the great-big-OR-gate for creating the global write and global read pending vectors, and have made some progress *after* making some modifications.  whether they're correct or not remains to be seen.

*as-is*, i noticed the following:

* due to a weird sequence of how Go_Write follows Issue, the dest, src1 and src2 latches (top of attached) correctly capture the incoming register #s

* Issue causes both latches to go "Set", which triggers the (already-latched) register #s to be decoded into Unary, for both the Write_Pending and Read_Pending vectors for *this* unit.

* the global vectors (incoming) are on a ONE CLOCK DELAY, so that this unit can reserve its own registers.  if this were not the case, the Function Unit would NEVER be able to reserve anything!

* let us assume that the global write INT pending vector is zero.

* therefore ANDing the src read pending vector with the global write INT pending vector produces... a zero vector

* ORing all bits of that together produces.... zero

therefore, at start-up, the Readable signal *NEVER* gets asserted, and we're hosed.  no Function unit can *ever* start.  the Readable signals from all FUs (which remain at zero) can go through the Picker, looking for one HIGH, which will never happen.

clearly this was not the intent!

looking at how the "Writable" signal is generated, i went, "well, what if we mirrored how that works?"  i then experimented with *inverting* the Readable signal (turning it into a NOR gate just like is is with Writable) and that made some progress.

however there were circumstances still that resulted in data corruption, so i decided to experiment by ANDing in the Q signal from the Go_Write latch (see attached diagram).

amazingly, this seems to work.

so, walking this (new) logic through:

* src1/src2/dest reg #s are captured

* when the two latches for RD/WR are low, the Writable *AND* Readable outputs, now both being ANDed, are BOTH off.  not doing this was the cause of problems due absolutely every single FU asserting Readable (permanently), thus confusing the Picker by having one (and only one) FU permanently selected.

* Issue goes HI which triggers the binary-to-Unary decoders and sustains the Function Unit's RD/WR Pending vectors (until otherwise dropped by Go_Wr / Go_Rd respectively)

* with both latches now HI, Readable and Writable are now "activate-able"

* src1/src2 unary vectors can now AND (correctly) with the Global Write_Pending vector, and if no bits from that are set, the NOR gate will output "HI".  this creates the RaW condition for this Function Unit

* thus, Readable is set if the GO_Write latch is set *AND* there is no RaW hazard.

* with Readable being set only if the Function Unit needs to go active, the Priority Picker isn't receiving multiple spurious Readable signals.

simply turning the Readable into a NOR gate was not adequate, because the Priority Picker generates the Go_Read signals, and, with the Priority Picker trapped by permanently-HI Readable signals, the Go_Read stayed hi, causing the Readable to stay HI, causing.... endless loop basically.

the question is, then: is this the right thing to do, and why was it happening in the first place?

l.
2019-05-15_16-25.jpg

lkcl

unread,
May 22, 2019, 5:54:49 AM5/22/19
to RISC-V HW Dev, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
hi folks, as this was getting a little long, the discussion has migrated here, for anyone who is interested in out-of-order precise designs (that do not require CAMs):

as of this afternoon, a demo test 4x4 suite of ALUs works successfully with random instruction allocations avoiding both RaW and WaR hazards, as long as a src register is not used as the destination in any one given instruction.  this creates a dependency loop (a RaW *and* WaR hazard) that clearly cannot ever be resolved, and is the current high priority active investigation.

a second bug under investigation is WaW hazards, which, in the 6600 design, are avoided by stalling issue: this is planned to be done through the same mechanism as precise exceptions and branch speculation: prevent write commit until the WaW hazard danger is cleared.

the actual source code is here:

l.

lkcl

unread,
May 22, 2019, 8:59:54 AM5/22/19
to RISC-V HW Dev, zar...@iis.ee.ethz.ch, mitch...@aol.com, jr...@cam.ac.uk
On Wednesday, May 22, 2019 at 10:54:49 AM UTC+1, lkcl wrote:
hi folks, as this was getting a little long, the discussion has migrated here, for anyone who is interested in out-of-order precise designs (that do not require CAMs):

as of this afternoon, a demo test 4x4 suite of ALUs works successfully with random instruction allocations avoiding both RaW and WaR hazards, as long as a src register is not used as the destination in any one given instruction.  this creates a dependency loop (a RaW *and* WaR hazard) that clearly cannot ever be resolved, and is the current high priority active investigation.

fixed.  it required ignoring both the read and write dependencies down the middle of the FunctionUnit to FunctionUnit Matrix (representing the FU to itself)
 
a second bug under investigation is WaW hazards, which, in the 6600 design, are avoided by stalling issue: this is planned to be done through the same mechanism as precise exceptions and branch speculation: prevent write commit until the WaW hazard danger is cleared.

also fixed.  this bug was much simpler: attempting to OR a single bit flag with a vector.

in a preliminary test, 5,000 random instructions have been run, with four ALUs (of varying completion times) and 8 registers, with random unrestricted arbitrary selection of src, dest and operand.  no failures.  that will include RaW, WaR, WaW and self-dependent instructions (src reg == dest reg).

the next phases will include adding LD/ST Function Units, branch speculation / cancellation, precise exceptions and more.

l.

lkcl

unread,
May 24, 2019, 3:13:35 AM5/24/19
to RISC-V HW Dev, zar...@iis.ee.ethz.ch, jr...@cam.ac.uk
On Wednesday, May 22, 2019 at 1:59:54 PM UTC+1, lkcl wrote:

the next phases will include adding LD/ST Function Units, branch speculation / cancellation, precise exceptions and more.

"shadowing" has now been added (the basis of precise exceptions, parallel write-after-write and branch speculation), and used to create write-after-write dependencies.  actually it is instruction-order-dependence, normally done as a linked-list or cyclic buffer (the Tomasulo ROB for example).

in this case however it is a 2D bit-matrix of latches, using an unary representation of the current instruction to be executed (its Function Unit #) and the (unary) previous instruction's FU#.  the previous instruction casts a "shadow" over the current one, preventing it from writing.

whilst this could be considered a performance limitation, given that instructions are prevented from running ahead (writing) to create results that future (potentially slower) instructions would need, operand forwarding is to be added that will mitigate this and get the performance back.

this project is not just about "making a processor", it is about making a *documented* and *easy to understand* processor that can be learned from (even though it is a comprehensive design), as well as audited by independent 3rd parties [for security issues and the non-existence of spying backdoors].

therefore if anyone has any questions, please do raise them either here or on libre-riscv-dev.

l.
Reply all
Reply to author
Forward
0 new messages