I'm on a plane, and Thunderbird did not synchronize all of this thread,
so I don't know what a BGOoO is.
But: I think that an OOO machine *should* be able to take advantage
static scheduling.
The dirty little secret of OOO is that we are often not very much OOO at
all - and the places where we are are often artifacts like
compare-branch. The big benefit of OOO is memory and cache misses, but
a secondary benefit is not having to put to many things in so many
different registers - you can reuse the same logical name many times,
and OOO basically just uses it to connect producer to consumers in a
dataflow sense. It would be nice if, once having done that, we did not
ti even try to write the results to a register file.
Time and again, in a loop, the same instructions have the same schedule.
It is criminal, however - ok, it seems suboptimal - that the OOO
hardware should be recalculating the dataflow graph and the schedule
that repeats itself so many times. Surely we could calculate once and
for all, in the compiler (static scheduling). Or, calculate it in the
hardware, and reuse it (semi-static scheduling).
Current OOO hardware does the most general and expensive scheduling for
every instruction, eagerly. This is simple, because it is one mechanism
that works for everything. But, if this calculation is expensive, power
hungry...
Perhaps what we should be doing is building machines that
a) take a static or semi-static schedule
b) perform a simple, cheap, calculation to determine
b.1) if the static or semi-static schedule is correct - e.g. if it does
not try to start a dependent instruction before its inputs are ready
b.2) but also if it is excessively conservative - if there is excessive
slack - if the instruction could have started earlier. But also,
perhaps an indication that "Yes, the compiler knew this instruction
could have started earlier, but it was deliberately delayed for some
reason". Although such a reason is usually something like a funky
pipeline resource dependency - or, rather, a resource dependency that is
not uniformly pipelined, a resource that appears at different places in
different pipelines, like using the FP-adder at the last stage of an FP
divide,.
c) for that matter, verify dependencies between instructions. Perhaps
not interesting for dependencies through registers, but more interesting
if the dependency graph is computed, e.g. dependencies carried through
memory, or through conditional operators phi operators - where we are
trying to do better than depend on all inputs. (This amounts to
predicting CMOVs or predicates.)
d) if the verification of schedule and dependency passes, no need to do
anything else - go straight to execution. (Or, you may have executed
before or while verifying in which case no need to undo execution.)
d') but if the verification of schedule and dependency fails, use the
more powerful OOO dataflow scheduling mechanisms to make things better.
And optionally remember the improved schedule (semi-static).
Traditional code scheduling tries to arrange instructions so that they
don't stall. Independent operations are interleaved. The compiler tries
to deliver instructions "just in time".
But this is not necessarily the best thing to do. Perhaps it is better
to have dependent instructions arranged in chains, back to back. Saves
registers, or at least register names. I have spent a lot of time
designing schedulers for "strings" or "chains" of dependent
instructions. Perhaps the compiler should give us these chains,
and instead of trying to arrange independent instructions next to each
other, should tell us where the next independent chains begin.
I believe Mark Smotherman calls this a dependence architecture, rather
than the independence architecture that classic VLIW and RISC
superscalar scheduling for in-order architectures try to be.
The problem with such chains is cross-chain dependencies. Also, I gave
up stopped pursuing this at UWisc when it became evident that the
dependency chains would have to run through memory, requiring memory
renaming or STLF store to load forwarding prediction. on x86 and Alpha.
Sometimes I think that what we need is the smallest possible code size,
with an auxiliary datastructure indicating hints as to what the static
schedule should be. What instructions should execute when, which loads
may be dependent on which stores and which they are guaranteed
independent of. ("Static store/load coloring").
Some ISA proposals specify which instructions depend on an instruction,
rather than having such dependencies specified indirectly through registers.
http://semipublic.comp-arch.net/wiki/Instruction_Destination_Instructions_rather_than_Registers
Trouble is, while all of these things are neat, they just don't seem to
be efficient in terms of instruction encoding.
This might be an argument for caching such semi-static results, i.e.
storing them in a decoded and scheduled I$ rather than as the true ISA.
But even here the storage needs to be more efficient than recomputing
the dependency graph and schedule.
I suppose than one of the things I like about vectors and SIMD ISAs is
that the dependency graph and schedule is amortized over many copies of
the same operation.
--
The content of this message is my personal opinion only. Although I am e