On Saturday, May 25, 2019 at 5:25:44 AM UTC+8, Ivan Godard wrote:
> On 5/24/2019 1:29 PM, lkcl wrote:
> > On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:
> >
> >> > * Branches have to be observed:
> >> no stall with skid buffers
> >> > * DIV and other FSM style blocking units are hard to integrate, as
> >> they block dependent instructions.
> >> no stall with exposed FSM
> >
> > What's the complexity cost on each of these? Both in terms of understandability as well as implementation details.
>
> Skid buffers: complexity trivial (capture FU outputs into ping-pong
> buffers; restore on branch mispredict.
I found a different use of skid buffers, to do with cache lines, glad you clarified.
Tricks like this - considered workarounds in the software world - are precisely why I do not like inorder designs.
The augmented 6600 scoreboard has enough information to actually interleave both paths (currently implementing), cancelling one entire path once the branch is known.
This may even be nested indefinitely (multiple branches, multiple paths).
The Function Units basically *are* the skid buffers (for free).
> Exposed FSM: it's just code in the regular ISA; not even microcode
Yuk :) again, application recompilation.
> > Optimistic atomics sound scary as hell.
>
> Transactional semantics have been around for decades; remember COBOL?
> IBM has it in hardware; Intel tried it for a while but there was too
> much mess trying to support both kinds of atomicity simultaniously (I
> understand)
Need to do a bit more research before being able to assess.
> > Skid buffers I have not heard of so need to look up (itself not a good sign).
>
> These address the delay between execution of a branch and execution of
> the branched-to, and incorporates any address arithmetic and the
> confirmation of the executed branch against the prediction. This delay
> seems to be two cycles typical, leading to delay-slot designs.
See above. Sounds complex..
> An alternative is to speculate down the predicted path and back up if
> wrong, absorbing the back-up in the mis-predict overhead, but this
> requires saving the state to back up to
And in an inorder design that gets more and more complex, doesn't it?
Whereas in the augmented 6600 design all you need is a latch per Function Unit per branch to be speculated, plus a few gates per each FU.
These gates hook into the "commit" phase, preventing register write (on all shadowed instructions), so no damage may occur whilst waiting for the branch computation to take place. It hooks *directly* into *already existing* write hazard infrastructure basically.
Fail causes the instruction to self destruct.
Success frees up the write hazard.
It's real simple, given that the infrastructure is already there.
> As the amount of (potential)
> back up is statically fixed, the pipeline can be arranged to save its
> state at every cycle into a list of buffers (or shift regs) of length
> equal to the branch realization delay. This are known as skid buffers,
> analogous to a car skidding as it makes a hard stop.
>
> Skid buffers do not work well if the rest of the design is not
> idempotent, because any unreversible state changes would require special
> handling t back up. Thus if the design uses genRegs it cannot be
> permitted to overwrite a register during the brief speculation period;
> that would require skid buffers for every register.
And that is precisely and exactly what the job of the 6600 style scoreboard does. Provides the direct equivalent of a skid buffer for every single register or other object it it designed to protect and manage.
> However when all or
> most actions during the speculation period are idempotent it is only
> necessary to retain the transient mutable state, typically the FU outputs.
Yes. The Computation Units (one per FU) have register latches and also an FU output latch.
If you want more speculation or more "free hidden reg renaming", you just add more FUs and load them up more (multi issue)
> The Mill hardware uses skid buffering in two places: by having the
> physical belt be twice the size of the logical belt, and in the store
> request stream in the hierarchy. There is actually one operation
> (rescue) that can address into the belt skid buffers to recover
> operands that would otherwise have fallen off the belt.
>
> > Not heard the term "exposed FSM" used until today.
>
> Just take the FSM out of the microcode and emulate it in the normal ISA
> (probably augmented with a few helper operations). The DIV (etc) op's
> code then interleaves/overlaps with everything else the machine width
> could be doing from the rest of the code.
>
> > All of which kinda helps make my point, that an inorder design has to turn into a multi headed hydra monster of design disproportionate complexity, in order (ha ha) to work around in-order inherent fundamental limitations.
>
> I guess that's true if you try to tack in-order as an on-top-of addition
> to the design. Try just using it as a way to discard complexity instead.
> The Mill is stunningly simple once you wrap your head around everything
> that we *don't* do :-)
:)
The one thing that I really miss from genRegs standard instruction sets is "tags" that say that a register is to be discarded after use.
Context, basically.
RISCV tries adding something like this, using FENCE instructions.
> I do appreciate your (or rather Mitch's) scoreboard design. It's a
> clever way to get some parallelism from a one- or two-way-issue machine,
I'm working on a way to get up to 4 issue, and 8 issue Vector Processing should be easy (due to the regular nature of the vectors).
Multi issue "the simple way" is I believe doable by detecting whether the src or dest regs overlap. In any given group of to-be-multi-issued instructions, at the first sign of an overlap, STOP at the instruction that would conflict. Let that be the point at which the *NEXT* cycle begins, and let the Dep Matrices sort it out (in that next cycle).
So with only non-overlapping instructions going into the Dependency Matrices in parallel, no conflicting paths will occur because the Matrices will not have any conflicts occur in either a row *or* a column, because you already made sure that would not happen by only allowing non-overlapping regs to enter the Matrices.
Quite simple and effective (and not scalable, see below)
It should be logically easy to notice that vector instructions should not conflict as much, and the logic required to detect longer runs should also be easier than for genRegs.
To resolve (preserve) the instruction commit order, I have hooked into the *preexisting* branch/precise/shadow system, using it in effect to create a 2D bitmatrix version of a ROB / cyclic instruction buffer.
Working out if an instruction may commit in this new multi-commit context is a matter of keeping a counter per instruction.
Increment every issued instruction by 1 when a new instruction is added. Decrement when an instruction is retired.
It should be evident that in a multi issue context, "retire possible" is a simple comparison "if counter less than or equal to number of instructions allowed to commit".
In this way several instructions can be retired even though they have this bitmatrix pseudo-linkedlist which would otherwise only allow 1 of them to retire in any given clock.
> and far better than classic OOO in my (admittedly ignorant)
> understanding.
State of the art revolves around Tomasulo. Milti issue Tomasulo is a pig to implement.
> I suspect that it would be competitive with our Tin
> configuration, which is wider but has to pay for static scheduling, and
> the hardware cost in area and power would be similar.
>
> It's not clear that it scales though.
Going into genRegs 6 or 8 issue territory (vectors are ok) would need a much more complex register overlap detection and mitigation algorithm than I am prepared to investigate at this very early stage of development.
Up to 3 or 4 however really should not be hard, and not a lot of cost ie the simple reg overlap detection should give a reasonably high bang for the buck, only starting to be ineffective above 4 instruction issue.
Schemes for going beyond that, which I have only vaguely dreamed up, I expect there to be a lot of combinatorial ripple through the Matrices that give me concern on the impact on max clock rate.
Overall, then, the summary is that all the tricks that an inorder pipelined general purpose register based design has to deploy, they all have to be done, and they all have to be done at least once. And, going beyond once for each "trick" (skid buffering) is so hairy that it is rarely done. Early-out pipelining messes with the scheduling so badly that nobody considers it.
By contrast, with the augmented 6600 design, all the tricks are still there: it's just that with the Dependency Matrices taking care of identifying hazards (all hazards), ALL units are free and clear to operate not only in parallel but also on a time completion schedule of their own choosing. WITHOUT stalling.
L.