It has been shown that often more than 1/3 of all computations, data
movement, etc., are dynamically dead - they will never be used by the
program, on its current path of execution.
There have been hardware proposals that delay computations, and elide
some such dynamically dead code. Something like the flip side of
[[hardware futures]].
; [[Semi-static optimization]]
Dynamic dead code elimination may be done as a [[semi-static
optimization]] in an [[optimized trace cache]].
Basically, one applies classic compiler algorithms to the code to be cached,
augmented by reachability predicates.
Or, another classic way of doing hardware optimization is to examine the
post-retirement instruction stream, eliminate dead code, and install
that in the [[optimized trace cache]].
; [[Rename time optimization]]
Dynamic dead code elimination may also be done in the [[renamer]]
pipestage. The basic idea is, instead of emitting uops as the are
decoded,one assembles blocks of uops in the renamer.
Essentially, large expressions, potentially with multiple inputs and
outputs.
One does not emit such a block of uops from the renamer to execution
until a usage of the block is detected.
Typically, a store that may be visible to another processor constitutes
a usage that will cause uops to be emitted.
(Although certain stores can be elided, e.g. push/pop/push [[stack
optimizations]] and [[temporary variable optimizations]].)
Similarly, the evaluation of a branch condition may cause uops to be
emitted. But one only need emit the subset of the buffered uops
necessary to evaluate the branch.
Exceptions and other situations whether the entire program state may
cause uops to be emitted.
And overflow of the renamer buffer.
If the output of an expression buffered in the renamer is overwritten,
that part of the expression can be eliminated as dead code.
; [[Dataflow optimization]]
[[Dynamic dead code elimination]] can also be done in the OOO dataflow
scheduling part of the machine.
Essentially, one assembles the dataflow graph, but one does not
necessarily start execution as soon as inputs are ready.
The renamer adds an additional [[WAW]] arc to the dataflow graph, for a
new uop that overwrites an old uop's output.
If that old uop has no [[RAW]] dependencies, then it can "fire",
cancelling itself,
and signalling that cancellation to its own input dependencies in turn.
This is essentially [[reverse dataflow execution]].
It is straightforward, although it does require hardware that is not
normally present.
It is easiest to accomplish with a [[bit matrix]] scheduler, where the
[[input bitvector]] becomes the [[output bitvector]] of a cancelled uop.
It is not at all clear how useful this form of dataflow dead code
elimination is:
it is buildable, but how much does it help?
It is probably most useful if it can be remembered in an [[optimized
trace cache]].
= [[Hardware futures]] =
[[Hardware futures]] are the dual of [[dynamic dead code elimination]].
Instead of eliminating code that does not need to be executed,
it delays execution and evaluation of code until it is known to be needed.
This delay naturally accomplishes [[dynamic dead code elimination]].
It also provides increased scope for [[hardware CSE]],
and other optimizations.
But it potentially increases latency.
As for [[dynamic dead code elimination]],
it is fairly easy to see how to do this at the renamer:
have the renamer accumulate expressions.
Only release an expression to be evaluated when a usage that demands
evaluation is detected.
We can go further, and avoid the buffering at the renamer becoming an
issue by emitting such buffered instructions for a hardware future
into a sort of side car storage, rather like that of Akkary's [[CFP]].
When evaluation is demanded, we issue a token to cause insertion of
those instructions.
Issue: are the buffered instructions pre or post rename? It is easiest
to buffer them post rename,
using dataflow single assignment registers and memory locations.
Also as above, this can be deferred into the OOO dataflow scheduler.
But, again, that is using a fairly critical resource.
To make hardware futures less dynamic, more [[semi-static]] at the
[[optimized trace cache]]
involves dividing the future into two parts:
the first part captures any live inputs,
that must be captured lest they change;
the second part does the actual evaluation on demand.
Typically memory load values must be live inputs, as well as registers,
and the future is constrained from doing stores that may be visible to
other threads.
I.e. the hardware future is constrained to be a chunk of code that has
no internal memory references.
Or at least only sharply limited internal memory references.
It is unclear how useful such a future would be. It would have to be
saving considerable register based computation
- complicated instructions such as transcendentals - or be a complicated
calculation on a small amount of memory.
Such hardware futures might be much more useful if we were released from
the constraints of the [[Von Neumann memory model]],
e.g. if we had classic dataflow style [[single assignment memory]].
But in 2011 that does not seem likely any time soon.
It is one of several articles I plan/hope to write on
http://semipublic.comp-arch.net/wiki/Take_every_optimization_in_the_Dragon_compiler_book_and_consider_it_in_hardware
Sounds to me, DDCE is a *limiting* case rather than the dual of
hardware futures. Delay forever!
> This delay naturally accomplishes [[dynamic dead code elimination]].
> It also provides increased scope for [[hardware CSE]],
> and other optimizations.
> But it potentially increases latency.
If the latency to exec an instruction is L cycles, you have to
decode ahead L cycles to see if its result will be needed.
This seems like an extension of "Dynamic Instruction
Scheduling" (Lynn Conway & others, IBM ACS, 1966). There any
instruction (in a window of N) that wasn't _blocked_ can
execute, with rules specifying what blocking meant (i.e.
operands not ready, etc.) and blocking state updated per
cycle. Here in effect you are adding a new constraint: an
instruction is only executed if its result (or side effect)
is needed. If you can't determine deadness in some T cycles,
you go ahead and execute it since you don't have sufficient
context to know it will never be needed. You may also be
forced to execute without knowing if a source operand is
about to be overwritten. [Please excuse my imprecise
language; I have only recently gotten reinterested in such
issues]
Almost like an on-the-fly construction of a petri-net model
for a moving window of N instructions or so.
Fair enough.
The futures proposals, both hardware and software, tend to go to
greater pains, take more work, to set up a future.
Whereas the dynamic dead code elimination tends to be lazy and greedy:
it tends to eliminate what it can see in a few cycles, 100 or so
instructions.
Typically the DDCE code, as you say "You may also be
forced to execute without knowing if a source operand is
about to be overwritten."
Whereas the hardware futures guys may capture that operand, but still
delay the computation.
I guess they are, the usual, points along a spectrum.
The nice thing about software futures is that they use knowledge of
datastructures to allow a future to do significant computation - like
traversing a linked data structure.
It's harder for hardware to know that large in-memory data structures
are not being modified in a way that would make the future inoperative.
One optimization that is common, but not (from what I have seen) usually
mentioned in compiler books is jump to jump eliminations. Instead of jumping
to a jump instruction, just go to the eventual target, possible extension of
the BTB?
Anyway, instead of going over old ground (the dragon book dates back to the
70s), how about what optimizations are possible if the hardware supported
them? Some of this has been touched on (stream processors, intelligent RAM,
etc), but it would be interesting to survey papers in the field to see if
anything interesting comes to light (yeah, in my spare time :( ).
- Tim