Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Move on 'commit' predication?

6 views

Skip to first unread message

Paul A. Clayton

unread,

Oct 21, 2010, 11:31:35 AM10/21/10

Would it make sense to assign a less likely but somewhat
probable block of instructions to a move-on-commit-style
rename mechanism? In the predicted more likely case,
the mini-ROB (this might work better in a microarchitecture
which allocates chunks of the ROB, e.g., to support flexible
allocation for multiple threads--even better might be the use
of virtual registers which could avoid the move overhead--,
though the 'predicated' results would require separate
storage) would simply be freed; in the predicted less likely
case, any executed instructions in the block (assuming
lazy but not excluded execution) would have their results
moved to the main register file while the unexecuted
instructions might have their destinations updated to the
main register file names, and the later instructions would
simply be re-executed. This would allow the avoidance of
a data dependency stall for predication and would not
require a register map table save (and would be less
expensive than eager execution down both paths).
(Obviously, the 'commit' can happen when the condition is
resolved not requiring that all previous branches be resolved.)

The main use for such a mechanism would seem to be
for simple hammocks, though loop termination prediction
might use such for loops with small bodies. If-then-else
constructs would be difficult to handle (aside: it might be
nice if one knew that all register names written to in either
path were dead [or cleared?] or written to in the other
path--this might not be an uncommon occurrence).

By the way, have there been any proposals for reducing
the register map table saving when few architected
registers are changed? E.g., a strongly predicted
not-taken 'short' hammock branch should not need to
save an entire map but also should not be predicated.
If the earlier assigned name for the architected
destination was included with the instruction payload,
each such instruction could replay as a move instruction
on a misprediction. It is not clear that saving only a
frame of the map would make sense. (If speculation
became extremely deep, it might make sense to have
an L2 set of recovery maps with allocation based on
likelihood and cost, perhaps using compression? [An
L2 return address stack seems even more attractive
since timing is more predictable as does
compression--recursive functions would tend to only
differ in a few LSbits and a stack pointer value
predictor could be used to prefetch return addresses
{something like table-based compression}])

Paul A. Clayton
just a (rambling) technophile

nedbrek

unread,

Oct 22, 2010, 6:50:46 PM10/22/10

"Paul A. Clayton" <paaron...@gmail.com> wrote in message
news:f3ea2ca0-c6c3-4014...@e14g2000yqe.googlegroups.com...

> Would it make sense to assign a less likely but somewhat
> probable block of instructions to a move-on-commit-style
> rename mechanism? In the predicted more likely case,

> the mini-ROB would simply be freed; in the predicted less likely

> case, any executed instructions in the block (assuming
> lazy but not excluded execution) would have their results
> moved to the main register file while the unexecuted
> instructions might have their destinations updated to the
> main register file names, and the later instructions would
> simply be re-executed. This would allow the avoidance of
> a data dependency stall for predication and would not
> require a register map table save (and would be less
> expensive than eager execution down both paths).
> (Obviously, the 'commit' can happen when the condition is
> resolved not requiring that all previous branches be resolved.)

Complicated new ideas are often best illustrated with a short instruction
sequence (annotated with structure updates):
cmp p1 = r1 - r2
p1.ld r3 = [r4]
!p1.add r3 = r5 + r6
ld r7 = [r3]

What happens in this sequence? At what times are instructions executed?
What names are given to the various registers? Once these names are loaded
into the scheduler, how can they be changed?

> By the way, have there been any proposals for reducing
> the register map table saving when few architected
> registers are changed?

The only schemes which require one map state per branch are checkpointed
machines - none of which have been actually built. PRF (as opposed to
RRF+ARF) machines have two maps - one speculative, one architected.

Ned

Paul A. Clayton

unread,

Oct 23, 2010, 10:47:25 AM10/23/10

On Oct 22, 6:50 pm, "nedbrek" <nedb...@yahoo.com> wrote:
> "Paul A. Clayton" <paaronclay...@gmail.com> wrote in messagenews:f3ea2ca0-c6c3-4014...@e14g2000yqe.googlegroups.com...
[snip]

> Complicated new ideas are often best illustrated with a short instruction
> sequence (annotated with structure updates):
> cmp p1 = r1 - r2
> p1.ld r3 = [r4]
> !p1.add r3 = r5 + r6
> ld r7 = [r3]
>
> What happens in this sequence? At what times are instructions executed?
> What names are given to the various registers? Once these names are loaded
> into the scheduler, how can they be changed?

Well, first--and this is certainly a lack of clarity in
my initial post--I was thinking of dynamic predication
where condition prediction and prediction confidence are
already part of the microarchitecture. If a predicated
ISA had predicate predictors, the above code might be
handled something like (curly braces indicate rename):

// predict p1 false, 80% confidence, i.e., high enough
// confidence that avoiding a data dependency would be
// desirable and the likely path should be given priority
// but not so likely that one wishes to use mechanisms
// where a misprediction recovery is expensive

Issue_Queue[x] <-
(ld Unlikely_Result_Buffer[a] = Mem[{r4}]; priority: low)
Issue_Queue[x+1] <-
(add {r3} = {r5} + {r6};
priority: normal)
Issue_Queue[x+2] <-
(ld {r7} = Mem[{r3}]; priority: normal)

issue from Issue_Queue[x+1]
issue from Issue_Queue[x+2]
issue from Issue_Queue[x] // LSU underutilized
. . .
p1 is actually true
{r3} = mov Unlikely_Result_Buffer[a]
replay from Issue_Queue[x+2]

--------------------------------------------------
// if Issue_Queue[x] does not issue before
// p1 is determined to be true

issue from Issue_Queue[x+1]
issue from Issue_Queue[x+2]
. . .
p1 is actually true
Issue_Queue[x+1] <- invalid/free
Issue_Queue[x] <-
(ld {r3} = Mem[{r4}]; priority: normal)
replay from Issue_Queue[x]

----------------------------------------------

// if prediction is correct, act as above
// with this difference

p1 is false
Issue_Queue[x] <- invalid/free
. . .
free Unlikely_Result_Buffer[a] by bumping pointer
// very much like ROB

A lot of necessary actions are not expressed above,
but they should be somewhat apparent. It is not
clear how the replay mechanism would work. My
guess would be that a copy of the decoded/renamed
instructions would be kept in the ROB (or similar
structure) and re-sent to the issue queue--i.e.,
one does not tie up the issue queues for a
relatively unlikely event. Maybe the actual
details would make this insanely complex, but
perhaps you can now perceive my intent. (Rather
than rewriting the destinations of the unissued
instructions in the unlikely path, might it be
easier to provide both destinations with a
source bit indicating which to use???)

> > By the way, have there been any proposals for reducing
> > the register map table saving when few architected
> > registers are changed?
>
> The only schemes which require one map state per branch are checkpointed
> machines - none of which have been actually built. PRF (as opposed to
> RRF+ARF) machines have two maps - one speculative, one architected.

There must be some miscommunication. The MIPS
R10000 (Kenneth C. Yeager, "The MIPS R10000
Superscalar Microprocessor" [1996]) had "a
four-entry branch stack. This contains the alternate
branch address, complete copies of the integer and
floating-point map tables, and miscellaneous control
bits." (This would seem to be extreme overkill for
certain server workloads that do not use FP and which
have a high branch density.)

Paul A. Clayton
just a technophile

nedbrek

unread,

Oct 26, 2010, 9:42:11 PM10/26/10

"Paul A. Clayton" <paaron...@gmail.com> wrote in message

news:2d72aec3-9d50-4d3e...@a36g2000yqc.googlegroups.com...

> On Oct 22, 6:50 pm, "nedbrek" <nedb...@yahoo.com> wrote:
>> cmp p1 = r1 - r2
>> p1.ld r3 = [r4]
>> !p1.add r3 = r5 + r6
>> ld r7 = [r3]
>

> Well, first I was thinking of dynamic predication

> where condition prediction and prediction confidence are
> already part of the microarchitecture. If a predicated
> ISA had predicate predictors, the above code might be
> handled something like (curly braces indicate rename):
>
> // predict p1 false, 80% confidence
>

> Issue_Queue[x] <-
> (ld Unlikely_Result_Buffer[a] = Mem[{r4}]; priority: low)
> Issue_Queue[x+1] <-
> (add {r3} = {r5} + {r6};
> priority: normal)
> Issue_Queue[x+2] <-
> (ld {r7} = Mem[{r3}]; priority: normal)

Ok, so the renamer ignores the "load r3=", and is updated with the preg for
the add... you've dedicated a chunk of pregs for unlikely names.

> issue from Issue_Queue[x+1]
> issue from Issue_Queue[x+2]
> issue from Issue_Queue[x] // LSU underutilized
> . . .
> p1 is actually true
> {r3} = mov Unlikely_Result_Buffer[a]
> replay from Issue_Queue[x+2]

The biggest danger here is that p1 may take a very long time to resolve.
You would like to dealloc the IQ in a timely fashion. Also, how does the IQ
know when to dealloc? Is there a signal from the ROB? Usually, you dealloc
either after schedule (and require replay/reinsertion) or after known good
execute (don't forget a cycle or two to drive that signal back to the
scheduler and get the entry ready for a new insert).

>> The only schemes which require one map state per branch are checkpointed

>> machines - none of which have been actually built. PRF (as opposed to
>> RRF+ARF) machines have two maps - one speculative, one architected.
>
> There must be some miscommunication. The MIPS
> R10000 (Kenneth C. Yeager, "The MIPS R10000
> Superscalar Microprocessor" [1996]) had "a
> four-entry branch stack. This contains the alternate
> branch address, complete copies of the integer and
> floating-point map tables, and miscellaneous control
> bits." (This would seem to be extreme overkill for
> certain server workloads that do not use FP and which
> have a high branch density.)

Interesting. I was never very familiar with the R10k (and I have forgotten
a lot).

The main issue is how to recover from an N+1 speculative branch. That is:
<architected state>
jcc tgt1
insts
jcc tgt2
more insts
<speculative state>

If the second branch resolves before the first, you must construct an
in-between speculative map. A stack of maps allows you to flash copy the
proper one. The two-map solution takes some time for a state machine to
reconstruct the map.

Was the R10k a shallow pipe machine? Perhaps they didn't want to take the
hit on rebuild time?

Ned

Paul A. Clayton

unread,

Oct 28, 2010, 6:24:28 PM10/28/10

On Oct 26, 9:42 pm, "nedbrek" <nedb...@yahoo.com> wrote:
[snip]

> Ok, so the renamer ignores the "load r3=", and is updated with the preg for
> the add... you've dedicated a chunk of pregs for unlikely names.

The idea is to have a design point between branch prediction
(expensive recovery on misprediction) and traditional
predication (introducing a data dependency that can
stall execution). It is certainly not free, but the cost
_might_ be justified. (Perhaps virtual registers would be
useful in this context, delaying allocation to a physical
register until it is actually used. ??)

[snip]

> The biggest danger here is that p1 may take a very long time to resolve.
> You would like to dealloc the IQ in a timely fashion.

Since the unlikely instructions are lower priority, one might
use a less expensive buffer (which might also be used for
instructions dependent on load misses?).

> Also, how does the IQ
> know when to dealloc? Is there a signal from the ROB? Usually, you dealloc
> either after schedule (and require replay/reinsertion) or after known good
> execute (don't forget a cycle or two to drive that signal back to the
> scheduler and get the entry ready for a new insert).

Hey, are you expecting me to actually _examine_ my ideas?! :-)

My guess would be that reinsertion from the ROB would be
used. (One is expecting that the predicted path will be
followed most of the time.)

(In many cases, the predicate will be computed in a somewhat
timely manner [here more traditional predication might have
been a win], the decision in these cases may be easier.)

[snip]

> Interesting. I was never very familiar with the R10k (and I have forgotten
> a lot).
>
> The main issue is how to recover from an N+1 speculative branch. That is:
> <architected state>
> jcc tgt1
> insts
> jcc tgt2
> more insts
> <speculative state>
>
> If the second branch resolves before the first, you must construct an
> in-between speculative map. A stack of maps allows you to flash copy the
> proper one. The two-map solution takes some time for a state machine to
> reconstruct the map.

You mean the second branch resolves to a misprediction?
(I think no action would be required for a correct
prediction.)

Constructing a new map seems messy and time consuming (after
a misprediction, one would want to execute down the correct
path ASAP). (Perhaps virtual registers would be useful.
One could swap in a map of names and gradually retranslate.
Later instructions would have valid dependence-linking names.)
Is the mapping information taken from the ROB or does a
separate structure provide this information? Is the
remapping hardware designed to go forward or backward depending
on the distance? With 100+ instructions in flight, a map
recovery at four renames per cycle could easily take ten
cycles. That seems crazy.

Obviously map copying has scalability issues. (By the way,
could one not use the two maps from an unused thread to
speed branch misprediction recovery in a multithreaded core?)
On the other hand, confidence estimates, path length,
previous map distance, and other factors might be used to
make allocation and free choices.

> Was the R10k a shallow pipe machine? Perhaps they didn't want to take the
> hit on rebuild time?

Yes the R10k had a four and a half stage basic pipeline (Fetch,
Decode|Map, Issue|Read, ALU, Write| ). I do not know if the
reasoning behind this design choice has been published--
probably not.

nedbrek

unread,

Oct 29, 2010, 5:53:15 PM10/29/10

"Paul A. Clayton" <paaron...@gmail.com> wrote in message

news:d4703c18-1690-42b4...@l8g2000yql.googlegroups.com...

On Oct 26, 9:42 pm, "nedbrek" <nedb...@yahoo.com> wrote:

>> Interesting. I was never very familiar with the R10k (and I have
>> forgotten
>> a lot).
>>

>> If the second branch resolves before the first, you must construct an
>> in-between speculative map. A stack of maps allows you to flash copy the
>> proper one. The two-map solution takes some time for a state machine to
>> reconstruct the map.
>
> You mean the second branch resolves to a misprediction?
> (I think no action would be required for a correct
> prediction.)

Correct.

> Is the mapping information taken from the ROB or does a
> separate structure provide this information?

Yes, you use the Arch and Phys Reg info from the ROB.

> Is the remapping hardware designed to go forward or backward depending
> on the distance?

As long as you have strict 1:1 mapping from Arch to Phys, you should be able
walk in either direction. I'm not sure what the implementations settled on.

> With 100+ instructions in flight, a map
> recovery at four renames per cycle could easily take ten
> cycles. That seems crazy.

You have time equal to the number of stages beween IPgen and Rename (plus or
minus some for various drive stages).

Think P4 - 10 stages is no problem :)

> Obviously map copying has scalability issues. (By the way,
> could one not use the two maps from an unused thread to
> speed branch misprediction recovery in a multithreaded core?)
> On the other hand, confidence estimates, path length,
> previous map distance, and other factors might be used to
> make allocation and free choices.

Yes, although I have an even better proposed use :)
http://www.google.com/patents?vid=USPAT7313676

>> Was the R10k a shallow pipe machine? Perhaps they didn't want to take the
>> hit on rebuild time?
>
> Yes the R10k had a four and a half stage basic pipeline (Fetch,
> Decode|Map, Issue|Read, ALU, Write| ). I do not know if the
> reasoning behind this design choice has been published--
> probably not.

Wow, 5 stages (they're leaving a lot of frequency on the table :). Yea, it
makes a lot of sense in that context.

Ned

Paul A. Clayton

unread,

Oct 29, 2010, 9:01:18 PM10/29/10

On Oct 29, 5:53 pm, "nedbrek" <nedb...@yahoo.com> wrote:
[snip]

> As long as you have strict 1:1 mapping from Arch to Phys, you should be able
> walk in either direction. I'm not sure what the implementations settled on.

It could be advantageous to restore from either direction
depending on where the mispredicted branch was in the ROB,
though supporting both directions might not be worth the
complexity.

[snip]

> You have time equal to the number of stages beween IPgen and Rename (plus or
> minus some for various drive stages).
>
> Think P4 - 10 stages is no problem :)

Exclusive of IPgen, I assume (since the processor should know
the address for most branches by the time a misprediction is
detected)--unless IPgen is actually IPissue. Even five stages
to fetch and recognize register identifiers seems like a long
time (for a sane instruction encoding).

[snip]

> Yes, although I have an even better proposed use :)http://www.google.com/patents?vid=USPAT7313676

I guess I will have to look at that and try to translate from
patent language to language of one somewhat familiar with the
art.

[snip]

> Wow, 5 stages (they're leaving a lot of frequency on the table :). Yea, it
> makes a lot of sense in that context.

It is not clear whether using the reads-and-writes-in-different-
phases mechanism was part of the motivation or just a side benefit.
I do seem to recall reading that some architects were really
surprised that the Alpha 21264 could be fast and wide and OoO, so
the MIPS designers might have thought fast was inconsistent with
wide and OoO.

Andy "Krazy" Glew

unread,

Oct 30, 2010, 3:06:33 PM10/30/10

On 10/29/2010 2:53 PM, nedbrek wrote:
> "Paul A. Clayton"<paaron...@gmail.com> wrote in message

>> Is the remapping hardware designed to go forward or backward depending

>> on the distance?
>
> As long as you have strict 1:1 mapping from Arch to Phys, you should be able
> walk in either direction. I'm not sure what the implementations settled on.

Unnecessary restriction: doesn't need to be 1:1.

My original HaRRM design was PRF based, with map checkpoints. The map
could map multiple architectural registers to the same physical - move
elimination, common subexpressions.

Each instruction emitted a map directive of the form
(lreg,oldpreg,newpreg). The list of such directives was stored
sequentially. An at retirement map would process the map directives,
freeing up the oldpreg (if there were no other references), and
established the new mapping.

By the way, you don't need to have both a speculative and a commit map.
You only need one, the speculative map, since if you really wanted to
be slow you could reconstruct the commit map at a fault by applying the
map directives backwards.

For that matter, you can also have several map checkpoints:
speculative, retirement, plus at several different places in the
speculative history - likely mispredict points. You can move the
intermwediate map chweckpoints either incrementally, up and down the
list of map directives, or you can throw one out and copy the
speculative map.

The big simplification from a 1:1 lreg->preg mapping is that you don't
need to reference count or garbage collect: when an lreg is written, at
retirement the preg gets freed. But you give up all sorts of neat
tricks like move elimination.

The list (I now call it a "trace" or "log") of map directives can be
optimized or compressed: multiple writes to the same lreg can be
combined, you can attach a count to pregs being freed up.

0 new messages