luke.l...@gmail.com wrote:
> On Thursday, October 29, 2020 at 4:50:21 PM UTC, EricP wrote:
>> It's just doing WAW detection but for sets of 4.
>
> ok, got it. interesting. i wonder how that would work out, under real workloads.
Dunno yet, I'm just working through this at the same time as you.
>> The 2 bit version counter has 4 busy flags, one per version.
>> On allocate, increment counter and check if busy flag set.
>> If already set then stall until busy clear.
>> On allocate, set the busy flag
>> On write back, clear busy flag.
>>
>> A scoreboard would already have such a busy flag for each
>> physical register so this just uses an expanded set of those.
>
> interesting. so... it would be... you would be ganging 4 busy signals from the DMs together to produce the 1 busy back at the WaW level?
>
> i am getting confused, trying to think it through.
>
>
>>>> Also when you introduce renaming, even a light version,
>>>> you also must consider conditional branch mispredict rollback.
>>> ok. so we're using Shadowing. so the hold-and-cancellation from the Shadow Matrices when "godie" is called would need to propagate back through to the register cache to say that the virtual reg allocation is now free for reuse, on all virtual registers allocated to all shadow-cancelled instructions, and no "damage" occurs (no rollback needed either. it means a lot of inflight RSes but that is tolerable)
>>>
>>> does that sound workable?
>> Sounds plausible. :-)
>
> :)
>
>> (I don't know enough about what you are doing to comment)
>
> standard precise-capable shadowed 6600. it means no rollback, no history, no snapshots and no transactions needed because nothing ever commits that is unsafe (it just sits in "inflight" result latches), and "cleanup" is provided automatically and implicitly through "cancellation".
>
> ok ok yes there is the concept of rollback, but only inasmuch as inflight operations are cancelled rather than allowed to commit (irreversibly)
Ok, and the commits (result writeback) happen in order when each
FU has the oldest instruction and is ready to retire.
Which ensures precise interrupts.
>> So on write back an FU sends the {Reg,Version} to RegFile
>> which checks if the version is the same as the one writing.
>
> the checking, itself, is an area of concern. does the checking require a multi-way CAM? this wooukd be incredibly expensive. i particularly want to avoid binary comparisons especially on a table with say 32 (or 128) rows, the gate coubt and power consumption will be off the scale.
>
> l.
No, no CAM required. There is only one current version number
for each arch/physical register so just a single "==" compare.
if (RegFileStatus[resultBus.DstReg].Version == resultBus.Version)
RegFileData[resultBus.DstReg] = resultBus.Data;
I'm not sure this versioning approach would actually improve anything.
IIRC you are using valueless Reservations Stations,
meaning the RS pull their values from the reg file when
all sources operands are ready and FU issues to execute.
That means there is no place to stash the forwarded values
of alternate register versions and it has to be held in
the FU until it is the oldest and allowed to write back.
That is going to keep FU's allocated until their results
are allowed to retire at result writeback.
Expanding on your original example
I1: LD r8, ....
I2: ADD r9, r9, 100
I3: ADD r9, r8, r7
I4: ST r9,...
which could dispatch the instructions to multiple FU's
but they would have RAW stalled in the register read stage.
assume R9's 2-bit version number is initially 0
I1: LD r9, ....
decode assigns r9-1 {reg=9,ver=1} as dest virtual register
RegFileStatus[r9].Version++
I2: ADD r9, r9, 100
decode assigns r9-1 as source virtual register.
decode assigns r9-2 as dest virtual register.
RegFileStatus[r9].Version++
dispatch passes I2 to FU-Add-1
FU-Add-1 stalls because source r9-1 is Busy (pending write)
I3: ADD r9, r8, r7
decode assigns r9-2 as source virtual register.
decode assigns r9-3 as dest virtual register.
RegFileStatus[r9].Version++
dispatch passes I3 to FU-Add-2
FU-Add-2 stalls because source r9-2 is Busy (pending write)
I4: ST r9,...
decode assigns r9-3 as source virtual register.
ST FU stalls because r9-3 is Busy
...
I1 LD FU completes and writes result to r9-1.
But since current version of r9 is 3
RegFileStatus[resultBus.DstReg].Version != resultBus.Version
and reg file write of result data does NOT occur.
The result releases the allocated version 1 Busy flag
RegFileStatus[resultBus.DstReg].Busy[resultBus.Version] = 0;
Also when FU-Add-1.RS sees {resultBus.DstReg,resultBus.Version}
matches one of its operand so pulls the data off the result bus,
and pulls its other operand from RegFile and starts execution.
...
I2 FU-Add-1 completes and writes result to r9-2.
But since current version of r9 is 3
and reg file write of result data does NOT occur.
The result releases the allocated version 2 Busy flag
Also when FU-Add-2.RS sees r9-2 result
matches one of its operand so pulls the data off the result bus,
and pulls its other operand from RegFile and starts execution.
...
I3 FU-Add-2 completes and writes result to r9-3.
Now the current version of r9 is 3
and reg file write of result data DOES occur.
The result releases the allocated version 3 Busy flag
Also when FU-Add-2.RS sees r9-2 result
matches one of its operand so pulls the data off the result bus,
and pulls its other operand from RegFile and starts execution.
It doesn't complete any faster than it would without versions.
Adding more result buses could allow multiple result writebacks
multiple fowardings, and multiple issues per clock but wouldn't
help this example.
Trying a WAW dependency:
I1: LD r9, ....
I4: ST r9,...
I2: ADD r9, r8, r7
I1: LD r9, ....
decode assigns r9-1 {reg=9,ver=1} as dest virtual register
RegFileStatus[r9].Version++
I2: ST r9,...
decode assigns r9-1 as source virtual register.
dispatch passes I2 to FU-ST-1
FU-ST-1 stalls because source r9-1 is Busy (pending write)
I3: ADD r9, r8, r7
decode assigns r8-0, r7-0 as source virtual register.
decode assigns r9-2 as dest virtual register.
RegFileStatus[r9].Version++
dispatch passes I3 to FU-Add-1
FU-Add-1 executes immediately because r8-0 and r7-0 are ready.
...
I3 FU-Add-1 completes but can't write back its result
because I2 has not completed and doing so would make
interrupts imprecise. I3 stalls at writeback.
...
I1 FU-LD completes and writes result to r9-1.
But current version of r9 is 2
and reg file write of result data does NOT occur.
The result releases the allocated version 1 Busy flag
RegFileStatus[resultBus.DstReg].Busy[resultBus.Version] = 0;
Also when
FU-ST-1.RS sees r9-1
matches one of its operand so pulls the data off the result bus,
and pulls its other operand from RegFile and starts execution.
FU-ST-1 starts execution immediately and unblocks r9-2 writeback.
FU-Add-1 writes back its result.
Which is the same as it would have been without versioning.
So these register versions really don't get you anywhere
because they have to write back in-order anyway.
And you can't do forwarding early because the RS's are
valueless and can only pull operands when all are ready.
Does this make sense?