On 8/8/2012 7:32 PM, Mark Thorson wrote:
> EricP wrote:
> You mean hybridizing EPIC and OoO? Wouldn't that
> be like hybridizing AC and DC?
>
To religious zealots, perhaps.
I wanted to reply to some other post something like "One of the things I
regret about Itanium's lack of market success is that it killed off
research in VLIW for many years. It also killed off OOO research." Here
I am not just talking about combining OOO and VLIW, but also "purist"
VLIW.
I think there were many good and interesting ideas in Itanium.
For example:
==> Register renaming *is* a problem for very wide OOO machines. A
moderate degree of explicit parallelism in the instruction set - 2-wide
or 4-wide - can reduce such ahrdware costs.
This may not be worth thinking about when you have a trace cache that
contains pre-renamed instructions - instead of renaming all
instructions, you only need to rename live-ins and live-outs that go
outside the block - but I think that it is always better to do something
once and for all in the compiler and jitter, than to have to do and redo
it every time the trace cache misses. Can't help but save power, so
long as the extra bits needed don't waste more power.
==> Predication - although branch predictors are good, they are not
always accurate. There are many things one can do to combine
predication with prediction, so that you do not always waste time
fetching executing instructions that your predictor accurately predicts
will not need to be executed.
==> Rotating registers - this is a very nice way of creating efficient
software pipelined loops, without code size explosion. I *often* find
that I can come much closer to saturating the machine if I heavily
software pipeline, but the code size cost of the loop prologue and
epilogue make it not pay off, and/or the costs of the branches that try
to predict which version of a loop you need to exit, and/or repair when
you exit early.
==> Sheer number of registers. The sheer number of registers can help
some codes. But the sheer number of registers can be painful for OOO to
deal with. I often wonder about following inthe footsteps of Cray1, and
creating 2 levels of register file: a first, smaller, RF that is easy
for OOO to deal with, and a larger L2 RF that may be less aggressively
renamed.
In some ways, some OOO hardware is already renaming into two levels
of physical register file. This might just be exposing such to compilers.
==> non-32 bit sized instructions. Other ISAs are going that way now.
--
Mark Thorson, who seems to diss hybridizing EPIC and OOO, elsewhere
talks about adding a level of indirection to the instruction set. This
is something I have often thought about.
The GPUs have shown that for certain applications large numbers of
registers - 128, 256, 8 bit fields - and really wide instructions - 64
bit, sometimes wider - can really help. Often with lots of specialized
widget instructions that *could* be replaced by simple sequencesof
instructions, but where simple combinatorc hardware can do in one
instruction cycle what would take
But for many applications the code size increase of using such large
instructions everywhere is unacceptable.
I wonder if we should not have a compact 32 bit wide instruction mode
(or even 16/32 bit mode) for use in most places. But for tight loops
expose a wider VLIW inspired instruction set - many registers, wide
instructions, rotating registers, predication.
We can imagine the compact instructions as being a specific subset of
the wide instructions.
We can imagine switching microarchitectres from OOO to VLIW in such
tight loops.
Heck, we can imagine using the level of indirection everywhere
- so instead of having 64 or 128 bit instructions in the tight loop
fill up a loop buffer, instead map smaller instructions, 16 or 32, to
the wider canonical 64 or wider instructions.