I have been under the strong impression that predication and OOO (out of
order) execution are mortal enemies. If A, then not B, and if B, then not
And yet I've recently seen some suggestions that McKinley may be an OOO
machine. I'm confused. Can anyone point me towards the light at the end of
That I can't help you with.
>I have been under the strong impression that predication and OOO (out of
>order) execution are mortal enemies. If A, then not B, and if B, then not
>And yet I've recently seen some suggestions that McKinley may be an OOO
>machine. I'm confused. Can anyone point me towards the light at the end of
This I can. Predication essentially involves putting a DAG (directed
acyclic graph) structure onto the instructions. Aka a partial order.
If you aren't a mathematician, that means a set of "A must come before
B" relationships such that there is no loop "A > B > C > D > A" or
If you have such a structure, there are instruction movements you can
perform that will break the relationships and others that won't. The
allowed ones are the ones that won't.
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Tel.: +44 1223 334761 Fax: +44 1223 334679
> I'm in learning mode on Itanium. For example, I'm trying to figure out why
> a chipset released to production in the year 2001 (460GX) uses PC66 DIMMs.
Because the design was specified "ages" ago when PC66 was the available
technology. After all, the original schedule had Itanium delivered in 1997...
I don't think there is any fundamental problem in combining them.
Consider: what predication does is essentially to construct larger
basic blocks[*] by converting control dependence into data dependence.
The dependences between instructions makes this a dataflow graph.
A superscalar engine, whether in- or out-of-order, executes the
incoming instructions subject to the dataflow and the issue rules of
the microarch. Normally, there is prediction and speculative execution
of branches, which breaks control dependences to allow early
issue. Some recent microarchs (21264?) seem to speculatively break
memory dependences as well.
Note that a dependence on a predicate register is just another data
dependence. Superscalarity is thus not inimical to IA-64 apart
from philosophically. For instance, the hardware could speculate on
predicate values to accelerate some computations. (This is the moral
equivalent to speculative execution, except data dependences seem more
flexible/complex than control dependences.)
Indeed, you could do like Yale Patt et al did (or PII/III/IV/...) and
have a frontend dynamically convert the ISA into micro instructions,
predicates, instruction groups, explicit speculation and all.
(Digression: perhaps the nicest thing about IA-64 or HPL-PD is that
the architecture permits you to play around with dependences without a
lot of pain: you can convert control flow to dataflow by predication;
you can break memory dependences to reorder loads and stores[**], even
without alias analysis; you can delay exceptions to break even more
[*] Not quite, because the predicated 'basic block' may have multiple
exits. One compiler technique converts all premature exits (eg,
predicated branches) into forward jumps to a single terminating
branch. While the hardware can't rely on this being done by the
compiler, I think you could implement it that way under the hood.
[**] Except Itanium needs appx.50 cycles to recover from
misspeculation, which is probably too expensive except for cautious
use. Hopefully it will change.
Thomas Lindgren thoma...@alteon.com
> I have been under the strong impression that predication and OOO (out of
> order) execution are mortal enemies. If A, then not B, and if B, then not
Predication can be fairly simply implemented in an Out of Order core.
Problem is, the most natural form requires 4 inputs:
Dest := IF cond THEN A+B ELSE Old-Dest
It is certainly possible to create 4-input 1-output microoperations,
but on some pipelines, e.g. pipelines that read all of their values
out of a register file, that's a lot of hardware.
You can split the predicated op into 2 3-input ops, etc., and that may
be worthwhilwe... but it gets clumsy.
Similarly, you can predict predicates - but if predicates are predictable,
why predicate at al? (That's a rhetorical question: there are several reasons
why predicate prediction is nicer than branch prediction, but the overall
issue with predictability still arises.)
The main problem with predication on an Out of Order machine is what
I call predicate dataflow latency. Do you want to delay the consumer
of the result of the predicated instruction until the last arriving input
from the set cond/A/B/Old-Dest arrives? Obviously, if cond arrives
early, you can dispatch if cond is false and Old-Dest arrives before
the A and B operands, and vice versa. But what if the condition arrives
late, but the operands A/B/Old-Dest arrive early? Do you want to delay
until the predicate arrives?
Obviously, I have a solution in mind, but I encourage others to find
different ones. I wouldn't mind seeing my terminology "predicate dataflow
latency" in general use to describe the problem.
One of the nice things about problems like this is that the solutions
are generally useful, even to non-VLIW, non-predicated, machines.
What's wrong with squashing the instruction if the predicate turns out
to be false? You already do that for speculative execution and on
some CPUs for instructions depending on loads.
>It is certainly possible to create 4-input 1-output microoperations,
>but on some pipelines, e.g. pipelines that read all of their values
>out of a register file, that's a lot of hardware.
With squashing and predictes being in a separate register file (as on
IA64), it should not be that bad.
BTW, can anybody explain why the 21264s cmov needs so many resources?
>The main problem with predication on an Out of Order machine is what
>I call predicate dataflow latency.
Yes. So what's its advantage over using a branch? The advantage of
the unpredictable predicate over an unpredictable branch is that the
branch feeds back from condition computation into instruction fetch,
whereas the predicate just feeds from an earlier condition code
computation into the execution stage, and only stalls the instructions
that depend on its result.
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
A minute more of though brought up the answer to the first question.
With squashing, predication feeds back into register fetch or somesuch
(one cycle later should be possible); only with the 4-input variant we
see the shorter path of feeding into Execute.
Hmm, not quite, there no difference between in-order and OOO in the
argument above; also, see below for a scheme that does not need that.
My current guess is that an OOO microarchitecture wants to rename the
destination register into something else. To get rid of the old-dest
operand, you would have to feed the predicate into the renaming logic
(long latency), or suppress renaming in that case (probably too much
complexity, not to mention the performance penalty).
Ok, here's a basic pipeline scheme for doing predicates (in a
Consider the sequence
p=... #1: compute predicate
r1 = ... if p #2: predicated instruction
... = r1 ... #3: use result of predicate
At first execute instruction 2 as if it was not predicated; i.e., no
additional ports needed. Start instruction 3 normally, reading r1
from the register file (no additional ports needed over the worst
case). Feed the predicate into the forwarding logic that decides
whether to use the result of instruction 2 or r1 (additional control
logic needed, but no additional MUX in the data path). Feed the
predicate into the commit/writeback stage to squash instruction 2 if
The control signals depending on the predicate are needed at the
earliest in the cycle after the execute stage of the predicated
instruction, so it might be possible to execute instruction 2 in the
same cycle as instruction 1 (that would require quite short paths
through the control logic, though).
Dealing with multiple predicated instructions writing to the same
register in the same cycle is possible if there are enough forwarding
busses (as there usually will be in a superscalar CPU with that many
Hmm, it should not be so hard to extend that to register renaming,
although I guess that depends on the renaming scheme used; renaming
can only happen on writeback anyway (and does not happen if the
predicate is false and the instruction is squashed), the forwarding
until then just has to take the predicate into consideration. I guess
I overlooked some nasty details, though. Or the reason for reading
old-dest is something else still:-)
What if you don't use general predication, but only a very limited form:
r1 = if p then r2 else r3
r1 = if p then r1 + r2 else r1
r1 = if p then r1 - r2 else r1
r1 = if p then constant else r1
r1 = if p then r1 + constant else r1
Limit the possibly predicated instructions so that the hardware
requirements don't grow except that the predicate is an additional
operand. The instruction is executed unconditionally, but its result
depends on the predicate. That would minimise changes in an existing
processor, and it doesn't put too much pressure on the instruction
length. No prediction is required. As an optimisation, some of these
instructions could be discarded if the processor detects a false
predicate early enough.
Of course this is more limited than a scheme that can predicate any
instruction, but I think it would work in quite a big percentage of all
cases where more general predication could be used.
A friend of mine started writing a Viberti decoder on 4stack, and he
instantly demanded a "max" function. One where you afterwards still have
access to the carry (predicate). The way the current 4stack conditional
execution works (nullifying further instruction when p is false) would
allow to do things like "r1 = if p then r1 op r2 else r1" without
problems, including max; but then you first have to compare and then to
I suggested expanding the predication logic so that you can have a
predicate setup (p := r1 comp r2), or a predicated select (r1 := if r1
comp r2 then r1 else r2), or both (r1 = if (p := r1 comp r2) then r1
Since the 4stack is in-order, nullifying further instructions is easy
(just gate the clock of the ALU and register partition with the
predicate). Also adding a select bypass to the ALU (for max/min
operations) is not difficult.
For OOO, using a generic nullify approach means a delay between p
evaluation -> register access or speculative execution of things that
afterwards are thrown away. It's also possible to have conditional
feedthrough instructions (if condition is null, feed through first
register), that would allow to implement the instructions described
above. You could also have two opcodes per instruction, and select based
on the predicate (registers then are the same).
However, IA-64 allows to compute the predicate in the same cycle as the
predicated instruction, so the predicate can only disable writeback -
all the hard work (register access, ALU operation) already happend and
produced heat. Not only that, the bypass logic also has to select the
right output based on the just-computed predicate.
You can make that OOO, too (just add another "hidden" instruction that
makes the conditional select), but it would increase latency, too. But
after all, I'm quite sure the way Itanic now does predicated execution
also increases latency ;-).
"If you want it done right, you have to do it yourself"
> What if you don't use general predication, but only a very limited form:
> r1 = if p then r2 else r3
> r1 = if p then r1 + r2 else r1
> r1 = if p then r1 - r2 else r1
> r1 = if p then constant else r1
> r1 = if p then r1 + constant else r1
Since you use the same predicate, the "risc" way to
do the same would be branchless and three instructions:
Squashing works on in-order machines, that write OVER their destinations.
If the operation is squashed, the old value of the destination still lives
Squashing doesn't work so well on dataflow machines, where each
operation has a different output. If the operation is squashed,
something must copy the old value of the logical register INTO
the new physical register that has been allocated for the destination.
You can do two separately squashable ops:
IF cond THEN Dest := A+B
IF NOT cond THEN Dest := old Dest
I.e. you have two microoperations writing the same physical
destination register - but only one will ever not be squashed.