When I looked at OoO predication I came to the conclusion that, while at an
ISA level a predicate-FALSE instructions are considered NOPs, at the uArch
level "canceled" instructions can still have housekeeping to perform.
All uOps are marked "Enabled" and "Disabled" by the predicate instruction,
and each uOp has two sets of OnEnable and OnDisable actions.
The OnEnable actions is the normal instruction operations.
OnDisable actions might be as simple as marking the Instruction Queue
entry as "Done" or more complicated such as propagating an old value
or recovering allocated uOp resources such as LSQ entries.
The situations where an instruction could really be canceled
was if the predicate value was already resolved at Dispatch
(hand-off from in-order front end to OoO back end) where the
Dispatcher could sometimes optimize the uOp to being a NOP if,
for example, it did not have a dest register, like a ST.
In that case Dispatcher could insert a NOP marked Done into IQ.
It seems to me similar rules would apply would apply for in-order
pipelines even if uOp OnDisable action just diddles the scoreboard.
Dynamically grouping their threads into OnEnable and OnDisable action sets
would optimize the FU utilization for enabled instructions as single
uOps wouldn't have a mixture of Onxxx actions in different lanes.
Disabled uOp sets can bypass the calculation unit altogether
leaving it free for actual calculations.
> So, my basic plan is that all instructions in the loop are inserted, then
> as various flow control decisions are made, instruction selection is
> performed by predication (not by branching). The instructions are already
> there, they just need to be turned on and off on a per iteration basis.
> But secondarily, each lane gets its own ALU+FPU (FCU) so it can compute
> what is required without needing to know what its neighbors are doing.
All of their dynamic packing and/or lane shuffling does assume that
there is a hardware advantage to having 8 lane SIMD calculation units
as opposed to 8 independent FP units.
Though the LSQ would still gain from packing multiple operations into
a single large one so it still might have some isolated lane shuffling.
>> It needs a wide wake-up matrix that indicate to up to 8 dependents
>> and 8 register write ports which result lane to pull their result from.
> Big fast wide stuff does need a lot of register ports (or at least the
> forwarding). AND while you can increase the read ports by duplication,
> you cannot with the write ports.
>> Just speculating... the Nvidia SER looks like it might be scheduling
>> and shuffling threads between calculation lanes in a similar manner,
>> and possibly deals with similar issues for result forwarding/writeback.
> Basically, what you are "getting at" is that this kind of reshuffling adds
> latency to the data flow. The observation is correct.
Yes, I was actually thinking of forwarding as a potential critical
path if they want to be able to launch back-to-back executes.
Adding muxes into that path could might cause "issues".
If they don't do back-to-back executes, maybe it is like a
barrel processor between warps, it might not be an issue.
All that dynamic lane shuffling presupposes that the savings in gates
for an 8 lane SIMD unit is worth the shuffle cost to keep all lanes
busy vs having 8 independent FP64 and 8 ALU units.
With 78 billion transistors, having 8 full FP64 units per shader
seems unlikely to break the gate budget but might the power budget.