On 5/25/2019 11:44 AM, lkcl wrote:
> On Saturday, May 25, 2019 at 5:25:44 AM UTC+8, Ivan Godard wrote:
>> On 5/24/2019 1:29 PM, lkcl wrote:
>>> On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:
>>>
>>>> > * Branches have to be observed:
>>>> no stall with skid buffers
>>>> > * DIV and other FSM style blocking units are hard to integrate, as
>>>> they block dependent instructions.
>>>> no stall with exposed FSM
>>>
>>> What's the complexity cost on each of these? Both in terms of understandability as well as implementation details.
>>
>> Skid buffers: complexity trivial (capture FU outputs into ping-pong
>> buffers; restore on branch mispredict.
>
> I found a different use of skid buffers, to do with cache lines, glad you clarified.
>
> Tricks like this - considered workarounds in the software world - are precisely why I do not like inorder designs.
Hardly a trick - remember, high end Mills are *much* bigger than what
you are considering. It takes more than one cycle for a stall signal to
propagate across the core, so we have to provide a way to deal with
parts that have already done what they shouldn't have.
You use issue replay, so you have to synchronize retire; Mitch's system
lets you do that, but the retire point is central and you have to deal
with the delay between knowledge of completion and retire. Because you
*assume* genRegs you think that the delay is zero because it's all at
the RF, but there really is the time to get FU-result back.
Mill is *much* more asynchronous. It uses static scheduling so that it
is always known that an input will always be available when and where it
is needed, without synchronization. But then it has to deal with inputs
being created that are not in fact needed, or which are not needed yet
but will be needed after the interrupt. For that we use result replay,
which completely changes how the pipe timings work and removes all the
synch points. Those skid buffers hold operands that have already been
computed but which will not be used until later.
> The augmented 6600 scoreboard has enough information to actually interleave both paths (currently implementing), cancelling one entire path once the branch is known.
We're a statically scheduled design, so all the interleave is present
but the work is done at compile time. I'm a compiler guy so I think
that's simpler than any runtime mechanism; you're a hardware guy so no
doubt you feel the reverse :-)
> This may even be nested indefinitely (multiple branches, multiple paths).
>
> The Function Units basically *are* the skid buffers (for free).
They are for us too, except that we can buffer more than one result from
an FU in the latency latches before we ever need to move to spiller buffers.
>> Exposed FSM: it's just code in the regular ISA; not even microcode
>
> Yuk :) again, application recompilation.
Depends on what the meaning of "compilation" is :-) Our distribution
format coming from the compiler is target-independent; no recomp is
needed. It is specialized to the actual target at install time. This is
no different than microcode, except that out translate occurs once at
install whereas with microcode the translate occurs at runtime on every
instruction.
Ours is cheaper. :-)
>>> Optimistic atomics sound scary as hell.
>>
>> Transactional semantics have been around for decades; remember COBOL?
>> IBM has it in hardware; Intel tried it for a while but there was too
>> much mess trying to support both kinds of atomicity simultaniously (I
>> understand)
>
> Need to do a bit more research before being able to assess.
>
>>> Skid buffers I have not heard of so need to look up (itself not a good sign).
>>
>> These address the delay between execution of a branch and execution of
>> the branched-to, and incorporates any address arithmetic and the
>> confirmation of the executed branch against the prediction. This delay
>> seems to be two cycles typical, leading to delay-slot designs.
>
> See above. Sounds complex..
Not really; it's just a longer physical belt extending into the spiller,
which is able to replay results as if they were coming from the original
FU.
Say you have a hardware divide with a 14-cycle latency. Issue the divide
and say the next bundle contains a taken call, which does 50 cycles of
stuff. 14 physical cycles after issue the div FU spits out a result, but
that result is only due 14 logical (in the issue frame) cycles after
issue, and because of the call, we are only one logical cycle on in the
issue frame. So the result operand stays in the div output latch, or, if
that latch winds up needed, will migrate to a holding buffer in the
spiller.
Eventually the call returns, and the spiller content gets replayed in
logical (same frame) retire cycles. Absent more calls or interrupts 13
physical (and logical) cycles the div result in the spiller will get
dropped on the belt. The consumers of that operand cannot tell that the
div result took a side path through the spiller while the called
function was running; everything in all the datapaths is completely as
if the call had never happened.
Hence we get to overlap execution with control flow just like an OOO
design does, except that we have no rename regs or the rest of the OOO
machinery. I do agree with you that when an IO design forces state synch
at every issue (or retire, for Mitch) then there is a performance
penalty. However, that penalty does not arise from IO, it arises from
the state synch. Once you realize that the synchronization is not
inherent in IO then all the retire penalty goes away.
In fairness, I admit that the IO issue may incur a different penalty
that an OOO may avoid. We use partial call inline and other methods for
that, but it's easy to find examples where an OOO will gain a few
percent on an equally-provisioned Mill. That one of the reasons why we
are wider than any practical OOO: if the few percent matter in your
actual app then just go for the next higher Mill family member.
>> An alternative is to speculate down the predicted path and back up if
>> wrong, absorbing the back-up in the mis-predict overhead, but this
>> requires saving the state to back up to
>
> And in an inorder design that gets more and more complex, doesn't it?
Everything is complex until you understand it. We here have watched
Mitch work you through to understanding scoreboards; I suspect if you
worked your way through the Mill you'd find it easy too. The hard part
will be abandoning your preconceptions. But that true for all of is,
isn't it?
> Whereas in the augmented 6600 design all you need is a latch per Function Unit per branch to be speculated, plus a few gates per each FU.
>
> These gates hook into the "commit" phase, preventing register write (on all shadowed instructions), so no damage may occur whilst waiting for the branch computation to take place. It hooks *directly* into *already existing* write hazard infrastructure basically.
>
> Fail causes the instruction to self destruct.
>
> Success frees up the write hazard.
>
> It's real simple, given that the infrastructure is already there.
Yeah, I know. It's a clever way to do retire-time sync in a design that
has to worry about hazards. Of course, why do retire-time synch? And why
have hazards?
It's very good within its assumptions.
>> and far better than classic OOO in my (admittedly ignorant)
>> understanding.
>
> State of the art revolves around Tomasulo. Milti issue Tomasulo is a pig to implement.
I agree.
>> I suspect that it would be competitive with our Tin
>> configuration, which is wider but has to pay for static scheduling, and
>> the hardware cost in area and power would be similar.
>>
>> It's not clear that it scales though.
>
> Going into genRegs 6 or 8 issue territory (vectors are ok) would need a much more complex register overlap detection and mitigation algorithm than I am prepared to investigate at this very early stage of development.
>
> Up to 3 or 4 however really should not be hard, and not a lot of cost ie the simple reg overlap detection should give a reasonably high bang for the buck, only starting to be ineffective above 4 instruction issue.
Impressive if you can; I'm too ignorant to say.
> Schemes for going beyond that, which I have only vaguely dreamed up, I expect there to be a lot of combinatorial ripple through the Matrices that give me concern on the impact on max clock rate.
>
> Overall, then, the summary is that all the tricks that an inorder pipelined general purpose register based design has to deploy, they all have to be done, and they all have to be done at least once. And, going beyond once for each "trick" (skid buffering) is so hairy that it is rarely done. Early-out pipelining messes with the scheduling so badly that nobody considers it.
>
> By contrast, with the augmented 6600 design, all the tricks are still there: it's just that with the Dependency Matrices taking care of identifying hazards (all hazards), ALL units are free and clear to operate not only in parallel but also on a time completion schedule of their own choosing. WITHOUT stalling.
Choices are easy when everything you don't understand yet can be
rejected as being too complicated or tricky by definition :-)