hiya florian, appreciate that you're busy with current tasks - i spotted a potential major design flaw that... how can i put this... "could be the cause of the very debugging and investigation tasks that i surmise may be preventing you from having the time to evaluate mitch alsup's book chapters and the information that i am providing on 6600-style scoreboard design" shall we say.
register-renaming (and built-in operand forwarding) is automatically achieved with the 6600-style scoreboard design by way of the Q-Table working along-side the NxN Function Unit Dependency Matrix. operand forwarding in the 6600 was achieved by using same-cycle (actually, falling-edge) write-through capability on the Register File. in modern processors, "write-through" of SRAMs can be used for this exact same purpose. this *can* be augmented by a separate additional operand-forwarding bus, very similar to an SRAM except... without the SRAM.also, pipelining in the 6600 was sort-of achieved by a "revolving door" created by the three-way interaction between Go_Read (register file read), Go_Write (Function Unit latch write) and Go_Commit [write to regfile, i think: mitch can you confirm?]. only two of these signals were permitted to be raised at any one time, and only when one of them was HI could the next-in-the-chain go HI, at which point the previous one would drop, hilariously a bit like a very short caterpillar going round in circles, forever. operand-forwarding was achieved when two of these lines (one of them Go_Read) were HI for example.the important thing is that actual pipelining was not introduced until the 7600, not many details are known about the 7600, however the Function Units input-latches and corresponding associated output-latch had this "revolving door" 3-way MUTEX on each of them *even in the 7600*.in the creation of the ariane scoreboard.sv, however, there is a problem, that is masked / hidden by the fact that the integer ALU operations are all single-cycle.* on the div unit, you *do* have a form of MUTEX that blocks the Function Unit input operands from being used whilst the DIV operation is in progress. you will therefore not experience any problems with DIV results.* on the integer FUs, these are single-cycle, so the problem is *HIDDEN*. note however there will also be a performance ceiling (64-bit MUL in a single cycle) due to gate latency, that, if fixed by creating a multi-stage integer-mul, *WILL* result in problems.
* on the LD/ST FU, you *may* see problems. i haven't investigated in depth, because the design is deviating from the 6600 by not splitting out LD/ST into its own separate sparse-array matrix (see section 10.7)* on the FPU FUs, which i see no evidence of a MUTEX, and i assume that they're multi-stage, you *WILL* experience problems.
the problem is: that without MUTEXes blocking the Function Unit from re-use until the output is generated, the Q-Table will become corrupted. this will show up *ONLY* during exceptions (and possibly branch speculation cancelling), because it is exceptions where rollback is initiated.
when a destination (result) register number is transferred through the Q-Table to a Function Unit, it does so by assuming that there is a commit-block on that register which "preserves" the name of that destination register. this is sort-of done by preventing its destruction using the write-hazard infrastructure until such time as its committing can "make it disappear" safely.thus: only when that result is *actually available* is it SAFE to DESTROY (retire) that result, because at the point at which the result is stored, all "commits" - all write hazards - have been cleared.by allowing the Q-Table to proceed to a new entry without allowing the write-hazards to be cleared, you are DESTROYING absolutely CRITICAL information.i repeat:by allowing the Q-Table to tell the Function Unit that the write-hazards do not matter, rollback is NO LONGER SAFELY POSSIBLE.the solution is outlined in Section 11.4.9.2 of mitch's book chapters. reproduced with kind permission from mitch alsup, an image for people who may not have these chapters:you can see there that there are *four* "apparent" Function Units, all with src1, src2 operand latches, and associated corresponding result latches, so it APPEARS as far as the NxN Function Unit Dependency Matrix that there are FOUR adders (or four FPUs).there are NOT four FPUs.there is only ONE (pipelined) FPU.that FPU however has *FOUR* sets of src1-src2-result latches.the absolutely critical insight here is to note that the number of FU latch-sets *must* exceed or be equal to the pipeline depth.* if it is less, the consequence will be that the pipeline will be underutilised.* if it is greater, there exists the increased ability of the design to undergo register renaming, however bear in mind that the FU NxN Dependency Matrix is, clearly, O(N^2).one solution in common usage is to merge multiple functions into the Computation Unit, funnily enough exactly as has already been done in both the ariane FPU Function Unit and the ariane integer ALU.you will be able to check that this is the case by temporarily creating a global "de-pipelining" mutex that only permits a single operation to be carried out at any one time. only one of an FPU operation, LD/ST operation, Branch operation or Integer operation may be permitted at any one time, *NO PIPELINING PERMITTED AT ALL*...... and at that point the problems that i anticipate you to be experiencing (based on an examination of this design) on exceptions and branch prediction should "puzzlingly and mysteriously disappear for no apparent reason".
we have an implementation of a multi-in, multi-out "fan" system in nmigen (if you're able and happy to read python HDL), at least the comments are useful:the "mid" - multiplexer id - is what is passed in down the pipeline, just as ordinary boring unmodified data, only to be used on *exit* from the pipe to identify which associated fan-out latch the result is to go to.that mux-id is used here for example:you can see later at lines 202, 203, 204 and 208, the mid indexes which of the "next stages" to route the incoming data to. i really like nmigen :)in the diagram in Mitch Alsup's book you can see that this is replaced with a FIFO (just to the side of the Concurrent Unit aka pipeline). however that particular design strategy works for a fixed-length pipeline, i.e. it *prevents* early-out and it *prevents* the amalgamation of multiple pipelines (with different lengths) behind a common "ALU" API.
by passing the multiplexer id down through the data, early-out, reordering pipeline layouts, *and* FSMs can all be combined and the multi-in multi-out Concurrent Unit doesn't give a damn :)early-out pipelines (such as FPU "special cases" for NaN, zero and INF being handled very early in the pipeline) allow less work to be done (less power utilised), however it would be anticipated that this would require dual-porting on the result stage (into the multiplexer). luckily however we *guarantee* that only *one* of the array of result stages is ever going to be active at any one time, so, bizarrely, ORing of the two possible paths may be deployed as opposed to requiring higher gate count MUXes.
two fan-outs will still be required (one for the early-out path on the FPU pipeline, one for the "normal" path on the FPU pipeline), it's just that the fanned-out outputs from each may be safely ORed together, given the *guarantee* that there will be only one mid in use at any one time.so... i think that covers it. summary: you're missing criticical MUTEXes on the Function Unit src1-src2-result latches, without which data corruption will occur (guaranteed) on any form of rollback. fix that, and you'll have an absolutely fantastic design.
l.
Dear Luke,thanks for all the suggestions and the book chapter.On Fri, Apr 19, 2019 at 3:22 PM <lk...@lkcl.net> wrote:hiya florian, appreciate that you're busy with current tasks - i spotted a potential major design flaw that... how can i put this... "could be the cause of the very debugging and investigation tasks that i surmise may be preventing you from having the time to evaluate mitch alsup's book chapters and the information that i am providing on 6600-style scoreboard design" shall we say.Unfortunately, this is not the reason as I still have to pursue a PhD degree (and you won't get that solely with engineering an in-order core) ;-).
I appreciate your input and I have the two book chapters as my easter lecture. I hope they will be insightful.
... and at that point the problems that i anticipate you to be experiencing (based on an examination of this design) on exceptions and branch prediction should "puzzlingly and mysteriously disappear for no apparent reason".I am actually not experiencing any more problem: The "design flaw" which produced the buggy behavior of non-idempotent reads was associating interrupts during commit. The observation to do that during decode helped to eliminate that problem. We are happily booting Debian Linux on the FPGA and multi-core SMP Linux on the OpenPiton platform (with PLIC).
early-out pipelines (such as FPU "special cases" for NaN, zero and INF being handled very early in the pipeline) allow less work to be done (less power utilised), however it would be anticipated that this would require dual-porting on the result stage (into the multiplexer). luckily however we *guarantee* that only *one* of the array of result stages is ever going to be active at any one time, so, bizarrely, ORing of the two possible paths may be deployed as opposed to requiring higher gate count MUXes.Our FPU manages all that.
so... i think that covers it. summary: you're missing criticical MUTEXes on the Function Unit src1-src2-result latches, without which data corruption will occur (guaranteed) on any form of rollback. fix that, and you'll have an absolutely fantastic design.One "problem" which I have, is this merged scoreboard/rob structure (as it is quite area and timing critical).
I have the possibility to rollback the entire state speculative state so no data corruption can occur ;-)
But I am certain that the current design point can be further improved. I am hoping that the lecture you send me is giving further insights.
Hi Luke,so I've taken the time and read the pages you have sent me, thanks again. It contains some neat tricks.
I am still a bit confused on a couple of things:1. The paper mentions a 8r/4w register file.. That seems quite unnecessary big for a single issue core.
You won't retire much more than one instruction and definitely not read more than two operands (let us concentrate on the integer part). Even for a dual issue approach, you read approximately 1.5 registers per instruction. So three ports should be sufficient.
Not a big deal I think that can be circumvented by some read/write port handshaking on the regfile
.2. Unfortunately, the drawings are quite hard to read as they are blurry pixel graphics
and the style is not very consistent or self explanatory I am sure I missed a couple of points there.
Also, there seem to be transmission gates in the drawings which I think are meant to be some kind of storage (flip-flops)..
3. The text talks about continuous scoreboard and dependency matrix scoreboard.
The former seems to be a distributed version which takes all the information from the reservation stations/FU and generates global signaling. Does it require extra storage somewhere centralized or can everything be computed combinatorial?
Although not entirely clear from the text it seems that the continuous scoreboard is preferable.
4.. What information exactly does the reservation station/FU need to contain?
I assume from the drawings that it also needs to capture the write data from the computation unit.
5. I am desperately missing a high-level diagram on how the different things fall in place.
6.. The text talks a lot about latches. I am not sure whether latches in the circuit sense are meant.
If so I would recommend you to take a different approach. Some text also indicates that you are making use of the transparent phases of latches. This all seems rather dangerous.
7. One part of the text (11.1.1) talks about instructions can enter the FUs even if they have WAW hazards, it juts needs to make sure it doesn't write its result back until the WAW hazards has been cleared.
The rest of the text talks about issue stalling if a WAW hazard has been detected.
8. What does "issue" actually mean?
Does it mean putting it in the reservation/station FU?
Or does it mean bringing it to the next stage aka read operands?
The image in 11.3.1 seems to indicate the former. Do you really have to keep a separate queue of unissued instructions?
What prevents you from putting them in the reservation station and marking them as "not issued"?
9. I am not really getting the placement of read reservations. If an instruction is being issued it places read and write reservations.
If the instruction unconditionally places a read reservation you are throwing away the temporal relationship?
How do you maintain order on the read operands if you issue out of order (e.g. if you defer issuance of one instruction which would be vital for the correct dependency)? In general how do you keep a "temporal order"?
Maybe if I'd know the exact fields of the scoreboard that would help me understand it.
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/ce05d1d6-67f3-45fb-a852-8174424dbb83%40groups.riscv.org.
See comments inline Luke and best of luck with your
implementation. I would recommend that you turn those schematics
into a Verilog simulation model (it supports T-latches which are
called tranif0/1 devices. Depending on the usage it may result in
so-called registered nets that have an impact on simulation
speed).
Quite possibly.... although in talking with Mitch he explained that only D latches are needed (3 gates). Flip flops (10 gates) are usually used by non-gate-level designers because they're "safe" (resettable).
And have a high cost.
3. The text talks about continuous scoreboard and dependency matrix scoreboard.
Yes. They are definitely separate.The former seems to be a distributed version which takes all the information from the reservation stations/FU and generates global signaling. Does it require extra storage somewhere centralized or can everything be computed combinatorial?
It looks that way! It really is much simpler than we have been led to believe.You can confirm this by examining the original circuit diagrams from Thornton's book.
Yes, the full gate level design is in that book. It's drawn using ECL as they literally hand built the entire machine using PCBs stuffed with 3 leg transistors you can buy from RSOnline, today!
Although not entirely clear from the text it seems that the continuous scoreboard is preferable.
Yes.
4.. What information exactly does the reservation station/FU need to contain?
Nothing! It's a D Latch bank! That's it! Latches for the src ops, latch for the result, and ... errr... that's it.Mitch and I had a debate about this (details), happy to relate when you have time.
I assume from the drawings that it also needs to capture the write data from the computation unit.
Yes.5. I am desperately missing a high-level diagram on how the different things fall in place.
Believe it or not, it is so much simpler than you may have been led to believe, the diagram on page 32 chap 10 *really is everything that is needed*
Or p28 or p30. Different highlighting on different sections.
6.. The text talks a lot about latches. I am not sure whether latches in the circuit sense are meant.
Yes D latches. Really D latches. Not flip flops.
If so I would recommend you to take a different approach. Some text also indicates that you are making use of the transparent phases of latches. This all seems rather dangerous.
(with thanks to Florian for kindly agreeing to publish what was formerly a private discussion, this may be of benefit to others in the future. i'm also attaching two key screenshots, reproduced with kind permission from Mitch Alsup. discussion edited slightly).
... sounds all very obvious so far, right? :) now the tricky bit:
| Apr 26, 2019, 10:33 PM (2 days ago) | |||
A write-through RAM is an optimised bit-cell (equivalent to a
D-latch) surrounded by sense amplifiers. Logically it is
equivalent to a latch, but it would be in no way competitive with
a latch in area or power consumption. Also in UDSM technologies
RAMs may require a threshold margin optimisation control as well
as optional BIST circuitry, which I guess is the part that is
interesting for you.
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/00fde2ca-2085-4e65-ac78-9c123d74f34a%40groups.riscv.org.
--
For very small, simple structures, logic elements are vastly more efficient than SRAMs.
SRAMs will have a word line which goes to all relevant content
(i.e. the corresponding bits in the word), but if N is less than
some magic, but quite large threshold then using an optimised bit
cell is so not worth it. Now the more interesting question is what
size of register file would be needed to justify a dedicated
register file IP. This will of course depend on the number of
simultaneous ports in use. A large number of ports will degrade
timing because of the number of word lines hanging off each cell.
For more details refer to (for example) Chapter 9 of "Fundamentals
of Modern VLSI Devices" by Yuan Taur and Tak H Ning. My
understanding is that you are planning to have a register file
with a large number of ports.
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/5e482cbe-1a0e-43e5-bbf8-7a12767d0cee%40groups.riscv.org.
SRAMs will have a word line which goes to all relevant content (i.e. the corresponding bits in the word), but if N is less than some magic, but quite large threshold then using an optimised bit cell is so not worth it. Now the more interesting question is what size of register file would be needed to justify a dedicated register file IP. This will of course depend on the number of simultaneous ports in use. A large number of ports will degrade timing because of the number of word lines hanging off each cell. For more details refer to (for example) Chapter 9 of "Fundamentals of Modern VLSI Devices" by Yuan Taur and Tak H Ning. My understanding is that you are planning to have a register file with a large number of ports.
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/fe8afe03-b2f3-410c-85e2-6204edf12179%40groups.riscv.org.
Hi Luke-san,Off topic:
Rather than company does with closed code set, open software tool remains opportunity to succeed after leave of owner, it is healthy.
I use chisel3.0 for my project, it generates so many intermediate wires, therefore, trace-check the generated code takes unnecessary time.
I will try using the nmigen, thank you for your info.
Still now, use of product tools is common sense. It is ok when company work, they have a responsibility for users under license. Ordinary users do not care of earth quake, and spend every day. Means difficulty of explain the moving shift to open source, especially companies of potential user.
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/4d828363-53c0-4f5b-8a61-756448ead7e1%40groups.riscv.org.