MitchAlsup wrote:
> On Saturday, March 27, 2021 at 12:28:44 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> In machines that issue multiple instructions per cycle, a sequence like::
>>>
>>> LDW R7,[blah]
>>> ADD R7,R7,#52
>>>
>>> The rename of the R7 of the load does not survive execution and can
>>> be marked as forward-only and not have a backing register allocated.
>> Yes I noticed the possible single-use dest register optimization too.
>> I gave some thought to it but it had some complications
>> so I put it aside. Some of those complications were:
>>
>> - You can't count on the register generator (LDW) and consumer (ADD)
>> being side by side as normally there would be a number of of instructions
>> between them. How does Rename know not to assign a physical register
>> at the generator? Looks like the instruction needs a single use flag.
>> Maybe a use for a prefix?
>
> In My GBOoO machine, if the instructions got packet into the same packet
> both registers are visible and the now dead one "deassigned" by altering
> its register specification field. If the instructions span packets, no alteration
> is performed.
Yeah, packet-o-uOps is way to do this. This optimization should
only be considered when one has a multi-uOp packet.
That eliminates all the issues I raised.
The packet's uOps all dispatch together so the handshake
is unnecessary to prevent race conditions.
It just has to detect that a dest arch register is
a dest again in a subsequent uOp in the same packet,
and there could be multiple such forwarding's in a single packet.
I notice that this packet forwarding approach could potentially
suffer the same lost updates as I described below for the model-91,
if an exception occurs within such a packet, and then it rolls
back to that packet start and single steps to the exception point.
A forwarded-only intermediate result could be lost.
For example,
MUL r0=r1*r2
LD r3,[x]
ADD r0=r0+r3
if MUL's r0 is forward-only and LD throws an exception then R0 is lost.
The forwarding packet has to be marked as such, and if an exception
occurs in it then it has to back up, do the rename it skipped,
and re-execute so intermediate results is always saved.
>> - At execute time, how does the generator delay broadcasting its result
>> until it knows the consumer is listening on the forwarding network?
>
> The design allows for the producer to ship its results as soon as it is calculated.
> No delay of send.
>
> There is a register map available at issue time so that if a value is being broadcast
> while a consumer is being issued, the consumer gets the result via forwarding.
> We tracked this with 3 state values {ready, present, waiting}. The waiting state
> was when we know the value is nowhere to be found. When its tag is broadcast
> the state transitions to present. If an instruction sees a source that is present
> he knows that this value is on the result bus at the present clock and takes this
> instead of the register file value. Ready indicates the register in the file has the
> desired value. Tag is broadcast 1 cycle in advance of data.
I do the same, its just the tags and operands are stored in the RS's.
>> Generator might execute and complete while consumer is still in front end.
>> Some kind of handshake is required - maybe a forwarding dependency matrix?
>
> {ready, present, waiting} state in register file
>> - It implies that there is a maximum number of instructions between
>> register generator and its consumer that an implementation guarantees
>> to support and detects an error if violated.
>> Otherwise deadlock is possible.
>
> It has to be "IN" the same packet.
Agreed.
>>>> In the model 91, the tags on RS operands minimizes RAW hazards.
>>> I would use the words "serialize to" instead of "minimizes".
>>>
>>>> The forwarding of tagged results to RS operands eliminates WAR and WAW.
>>> The reading of the register file at issue eliminates RAW.
>> In some designs, yes - ones that read the reg file only when
>> the uOp is ready to issue to execute.
>
> Which is after the RAW has been removed, so that is acceptable.
>
>> The model 91 and my hobby design both use valued RS's so the
>> register file is only read at dispatch if register is valid.
>> RAW values are received through forwarding.
>
> My GBOoO had reservation station capture where the result bus data
> was captured into RS entries. It ALSO had forwarding from result
> buses to operand buses, and could forward the instruction being issued
> into execution if the RS did not produce anything to execute. In effect
> there was no cycle devoted to RS in the pipeline, RS was in the feedback
> path between cycles. This improved performance when the machine
> "went empty" after mispredict recovery.
If I understand you correctly,
when a new uOp is dispatched and arrives at a FU
and all its operands are ready,
and a Calculation Unit (CU) is available,
and no other RS's are ready,
then you want to bypass the RS and launch and execute immediately.
It is certainly desirable, and was one of my reasons for using
latches rather than FF in the RS as that might allow the uOp to
flow-through an RS onto the issue bus and directly into the CU,
but I wasn't assuming propagation delays allowing for this.
>> For me the advantage of that approach was simplicity in that it did not
>> require considering scheduling competing requests for reg file ports at
>> the point of issue, which is a critical path. Each F.U. can make its
>> own scheduling decisions locally based just on its own RS.
>
> My GBOoO machines had 6 slots, 6 instructions could be issued every cycle,
> 6 enter execution, 6 deliver values each cycle, and 6 could write the register
> file each cycle. No port scheduling, no bus scheduling. The RF has 12 read
> ports and 6 write ports and was comprised of 2 copies of a 6R6W file; one
> serving slots[0..2] the other servicing slots[3..5]. Store data was read from
> the file after tag-hit on a lazy execution pipeline (with interlocks).
For a dual dispatch, I was thinking of 4 read ports and 2 write ports.
The uOps dispatched would depend on the total number of read ports required.
The problem is it gets really messy with routing operands all over.
Its not very elegant, probably should be re-examined.