On 5/25/2019 11:44 AM, lkcl wrote:
> On Saturday, May 25, 2019 at 5:25:44 AM UTC+8, Ivan Godard wrote:
>> On 5/24/2019 1:29 PM, lkcl wrote:
>>> On Saturday, May 25, 2019 at 4:10:11 AM UTC+8, Ivan Godard wrote:
>>>
>>>>       > * Branches have to be observed:
>>>>                no stall with skid buffers
>>>>       > * DIV and other FSM style blocking units are hard to integrate, as
>>>> they block dependent instructions.
>>>>                no stall with exposed FSM
>>>
>>> What's the complexity cost on each of these?  Both in terms of understandability as well as implementation details.
>>
>> Skid buffers: complexity trivial (capture FU outputs into ping-pong
>> buffers; restore on branch mispredict.
> 
> I found a different use of skid buffers, to do with cache lines, glad you clarified.
> 
> Tricks like this - considered workarounds in the software world - are precisely why I do not like inorder designs.
Hardly a trick - remember, high end Mills are *much* bigger than what 
you are considering. It takes more than one cycle for a stall signal to 
propagate across the core, so we have to provide a way to deal with 
parts that have already done what they shouldn't have.
You use issue replay, so you have to synchronize retire; Mitch's system 
lets you do that, but the retire point is central and you have to deal 
with the delay between knowledge of completion and retire. Because you 
*assume* genRegs you think that the delay is zero because it's all at 
the RF, but there really is the time to get FU-result back.
Mill is *much* more asynchronous. It uses static scheduling so that it 
is always known that an input will always be available when and where it 
is needed, without synchronization. But then it has to deal with inputs 
being created that are not in fact needed, or which are not needed yet 
but will be needed after the interrupt. For that we use result replay, 
which completely changes how the pipe timings work and removes all the 
synch points. Those skid buffers hold operands that have already been 
computed but which will not be used until later.
> The augmented 6600 scoreboard has enough information to actually interleave both paths (currently implementing), cancelling one entire path once the branch is known.
We're a statically scheduled design, so all the interleave is present 
but the work is done at compile time. I'm a compiler guy so I think 
that's simpler than any runtime mechanism; you're a hardware guy so no 
doubt you feel the reverse :-)
> This may even be nested indefinitely (multiple branches, multiple paths).
> 
> The Function Units basically *are* the skid buffers (for free).
They are for us too, except that we can buffer more than one result from 
an FU in the latency latches before we ever need to move to spiller buffers.
>> Exposed FSM: it's just code in the regular ISA; not even microcode
> 
> Yuk :)  again, application recompilation.
Depends on what the meaning of "compilation" is :-) Our distribution 
format coming from the compiler is target-independent; no recomp is 
needed. It is specialized to the actual target at install time. This is 
no different than microcode, except that out translate occurs once at 
install whereas with microcode the translate occurs at runtime on every 
instruction.
Ours is cheaper. :-)
>>> Optimistic atomics sound scary as hell.
>>
>> Transactional semantics have been around for decades; remember COBOL?
>> IBM has it in hardware; Intel tried it for a while but there was too
>> much mess trying to support both kinds of atomicity simultaniously (I
>> understand)
> 
> Need to do a bit more research before being able to assess.
> 
>>> Skid buffers I have not heard of so need to look up (itself not a good sign).
>>
>> These address the delay between execution of a branch and execution of
>> the branched-to, and incorporates any address arithmetic and the
>> confirmation of the executed branch against the prediction. This delay
>> seems to be two cycles typical, leading to delay-slot designs.
> 
> See above. Sounds complex..
Not really; it's just a longer physical belt extending into the spiller, 
which is able to replay results as if they were coming from the original 
FU.
Say you have a hardware divide with a 14-cycle latency. Issue the divide 
and say the next bundle contains a taken call, which does 50 cycles of 
stuff. 14 physical cycles after issue the div FU spits out a result, but 
that result is only due 14 logical (in the issue frame) cycles after 
issue, and because of the call, we are only one logical cycle on in the 
issue frame. So the result operand stays in the div output latch, or, if 
that latch winds up needed, will migrate to a holding buffer in the 
spiller.
Eventually the call returns, and the spiller content gets replayed in 
logical (same frame) retire cycles. Absent more calls or interrupts 13 
physical (and logical) cycles the div result in the spiller will get 
dropped on the belt. The consumers of that operand cannot tell that the 
div result took a side path through the spiller while the called 
function was running; everything in all the datapaths is completely as 
if the call had never happened.
Hence we get to overlap execution with control flow just like an OOO 
design does, except that we have no rename regs  or the rest of the OOO 
machinery. I do agree with you that when an IO design forces state synch 
at every issue (or retire, for Mitch)  then there is a performance 
penalty. However, that penalty does not arise from IO, it arises from 
the state synch. Once you realize that the synchronization is not 
inherent in IO then all the retire penalty goes away.
In fairness, I admit that the IO issue may incur a different penalty 
that an OOO may avoid. We use partial call inline and other methods for 
that, but it's easy to find examples where an OOO will gain a few 
percent on an equally-provisioned Mill. That one of the reasons why we 
are wider than any practical OOO: if the few percent matter in your 
actual app then just go for the next higher Mill family member.
>> An alternative is to speculate down the predicted path and back up if
>> wrong, absorbing the back-up in the mis-predict overhead, but this
>> requires saving the state to back up to
> 
> And in an inorder design that gets more and more complex, doesn't it?
Everything is complex until you understand it. We here have watched 
Mitch work you through to understanding scoreboards; I suspect if you 
worked your way through the Mill you'd find it easy too. The hard part 
will be abandoning your preconceptions. But that true for all of is, 
isn't it?
> Whereas in the augmented 6600 design all you need is a latch per Function Unit per branch to be speculated, plus a few gates per each FU.
> 
> These gates hook into the "commit" phase, preventing register write (on all shadowed instructions), so no damage may occur whilst waiting for the branch computation to take place. It hooks *directly* into *already existing* write hazard infrastructure basically.
> 
> Fail causes the instruction to self destruct.
> 
> Success frees up the write hazard.
> 
> It's real simple, given that the infrastructure is already there.
Yeah, I know. It's a clever way to do retire-time sync in a design that 
has to worry about hazards. Of course, why do retire-time synch? And why 
have hazards?
It's very good within its assumptions.
>> and far better than classic OOO in my (admittedly ignorant)
>> understanding.
> 
> State of the art revolves around Tomasulo. Milti issue Tomasulo is a pig to implement.
I agree.
>> I suspect that it would be competitive with our Tin
>> configuration, which is wider but has to pay for static scheduling, and
>> the hardware cost in area and power would be similar.
>>
>>    It's not clear that it scales though.
> 
> Going into genRegs 6 or 8 issue territory (vectors are ok) would need a much more complex register overlap detection and mitigation algorithm than I am prepared to investigate at this very early stage of development.
> 
> Up to 3 or 4 however really should not be hard, and not a lot of cost ie the simple reg overlap detection should give a reasonably high bang for the buck, only starting to be ineffective above 4 instruction issue.
Impressive if you can; I'm too ignorant to say.
> Schemes for going beyond that, which I have only vaguely dreamed up, I expect there to be a lot of combinatorial ripple through the Matrices that give me concern on the impact on max clock rate.
> 
> Overall, then, the summary is that all the tricks that an inorder pipelined general purpose register based design has to deploy, they all have to be done, and they all have to be done at least once. And, going beyond once for each "trick" (skid buffering) is so hairy that it is rarely done.  Early-out pipelining messes with the scheduling so badly that nobody considers it.
> 
> By contrast, with the augmented 6600 design, all the tricks are still there: it's just that with the Dependency Matrices taking care of identifying hazards (all hazards), ALL units are free and clear to operate not only in parallel but also on a time completion schedule of their own choosing. WITHOUT stalling.
Choices are easy when everything you don't understand yet can be 
rejected as being too complicated or tricky by definition :-)