right. ok. this question makes sense.
> Shallow pipeline = much higher [than deep pipeline at the same clock frequency] voltage at marginally better throughput and much better latency.
ah no. there are three decisions to make (actually 4 in "traditional") design and unfortunately what you are doing is conflating at least two of those decisions.
* clock: fast or slow
* voltage: high or low
* pipeline length: long or short
* [and for traditional processor design]: parallelism: more or less.
> But we already established that latency is relatively unimportant, otherwise we wouldn't build it as GPU/GPGPU. So taking throughput per as watt as main figure of merit and assuming, generously, that shallow pipeline will require 1.7x higher voltage than deep pipeline and will deliver 1.2x higher throughput (both figures extremely optimistic) shallow pipeline loses by factor of 2.4X.
you're confusing several things (which is probably why you're asking the question).
normally, these would be 2 completely separate chips:
A GPU:
* low clock rate
* lower voltage
* short pipelines
* large parallelism (to make up for the lower clock rate)
A CPU:
* high clock rate
* higher voltage (to compensate for the power drain caused by high clock rate)
* longer pipelines
* less parallelism (because the applications can't cope, and the clock is faster anyway)
interestingly, give-or-take, the actual instruction COMPLETION time is the SAME (give-or-take), because whilst the clock rate is (for example) doubled, the pipeline length is (for example also) doubled, when comparing these two designs.
now let's go to the "hybrid" design.
* the amount of actual hardware is the same.
* therefore there is no "advantage" compared to the typical-CPU vs
typical-GPU as you do not have "more parallelism".
however, what you *do* have is the option to open up the flip-flop gates, halve the pipeline length, HALVE the clock rate, and then REDUCE THE VOLTAGE, which you can do because you are running at a LOWER clock rate.
Q: is the instruction completion time reduced?
A: no.
Q: is the power still reduced?
A: YES.
Q: what would the performance be like when compared to a "normal" CPU,
if we took a "normal" CPU and simply halved the clock rate?
A: the performance would be the same.
Q: would the power reduction be the same?
A: yes
Q: would we still be able to reduce the voltage?
A: yes.
Q: so what's different, then?
A: *THE INSTRUCTION COMPLETION TIME*
Q: so the instruction completion latency would be reduced?
A: YES
Q: okaay, so that has implications for the number of branch prediction
units needed, and so on, doesn't it?
A: YEEEEEES.
Q: and the number of Reservation Stations needed in an OoO design?
A: yeeeees
Q: and in an in-order design, the shortened pipelines would mean less
stalling, wouldn't it?
A: yeeeeees!
Q: so overall, the "standard" CPU design, because of the longer pipelines,
would have significant performance / latency penalties compared to the
"flexible" pipeline design?
A: yeeeeees.
now let's compare a GPU vs this hybrid design: we're going to go the *other* way. let's start from the gates being OPEN and let's try to ramp up the clock rate and the power.
Q: the "standard" GPU with short pipelines, can we ramp up the clock rate?
A: NO.
Q: why the heck not?
A: because the stages are so short - the number of gate chains
connected together are so long, that this sets a MAXIMUM FREQUENCY that
CANNOT BE EXCEEDED.
so the questions stop right there. the "standard" GPU design which has a low clock frequency (low MAXIMUM clock frequency as it turns out), simply cannot be utilised for CPU workloads.
the assumption (i think) that you made was that it would even be *possible* for a (fixed-length) short pipeline design to increase the clock rate.
often this is simply not the case, and in fact, in our design, we will very deliberately pick the number of stages in the "shortened" layout so that it will run successfully at only up to around 800mhz in that configuration.
we will *deliberately* design the number of combinatorial blocks per pipeline stage so that if you want to go above that number (to 1.5ghz) you *have* to close the gates, activate the flip-flops that will be present on every other stage, and double the pipeline length in the process.
*now* you can go to 1.5ghz [and have to increase the voltage to match].
clearly, in this configuration, the increased latency (the longer pipeline depth) would need to be compensated for.
so, actually, yes, you would actually need lots of branch prediction/speculation units, lots more Reservation Stations and so on, in order to "cope" with the longer pipelines.
which has some interesting implications when you go back down to the shorter depth.
the problem with an Out-of-Order design is that if you have a pipeline depth of say 12, you *MUST* have a minimum of 12 Reservation Stations (actually, 12 Reservation Station "Rows", if you use a Tomasulo Algorithm).
the reason is because you cannot - must not - have data in the pipeline that you cannot "re-associate" with its originator.
therefore, if you have say a pipeline depth of 12, and only 6 RS's (6 RS "rows" total in Tomasulo), the maximum utilisation of the pipeline is ONLY FIFTY PERCENT. as in: it must run with ONLY 6 things being processed at any one time.
if this is not obvious: if you add a 7th, what "ID" do you give it, so that you can re-associate it with the Rservation Station when it comes back out the pipeline? there are only 6 RS's, there is only one set of operand latches per RS, there are only 6 "IDs"... so... um... how on earth can you even *put* a 7th item into the pipeline, when you have nowhere to store the result??
this is clearly a problem :)
so, you *have* to design the Dependency Matrices (or Tomasulo RS's) to cope:
* 12-long pipelines == 12-long RS's (or more).
now let's drop the clock rate, and halve the pipeline length. now you have 12 RS's, but the pipelines are only SIX long. that means that there are now twice as many RS's as there is pipelines that can cope with them.
so what do you do with all those yummy extra RS's?
stuff them with data of course! :)
now you can do things like:
* extra run-ahead branch-prediction on loops (yay!)
* detect more opportunities for operand-forwarding (w00t!)
* double the multi-issue execution rate from 2 to 4 (or 4 to 8). yippee!
all of which occurs *without* having any kind of major performance penalties or increase in instruction completion latency.
there's probably some more that i haven't thought of.
so it's a huge set of advantages that go well beyond those of a first glance.
l.