CPU which executes both paths of a branch speculatively

Kli-Kla-Klawitter

unread,

Sep 18, 2020, 11:40:06 PM9/18/20

to

In another forum someone alleged that there are OoO-CPUs which execute
both paths of a branch speculatively and drop the path that turns out
to be wrong. Is there really such a CPU, or more: is there a x86-CPU
that works this way ?

Anton Ertl

unread,

Sep 19, 2020, 6:55:39 AM9/19/20

to

No such CPU. Branch prediction works very well. And if it did not,
one probably would not build OoO CPUs.

In earlier times, IIRC the first Power CPU (1990) fetched from both
sides of the branch, but that was an in-order CPU.

Augustus K. Uht published papers about "disjoint eager execution"
(DEE) in the 1990s: "eager execution" is following all branches; DEE
means computing the likeliness that a certain patch is taken, and
always choosing the most likely extension of the paths up to now.
Sounds promising, but given the prediction accuracies of branch
predictors since the late 1990s, there probably is rarely a difference
between DEE and single-path (always follow the prediction). So I have
not heard anything in that direction for two decades.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Ivan Godard

unread,

Sep 19, 2020, 8:03:15 AM9/19/20

to

The question seems to assume that there actually is a branch instruction
and an architectural choice as to how to proceed from the branch. That
is a genuine design point for dynamic architectures, both IO and OOO.

There is however an alternative approach, followed to some extent in
many architectures: static path merging, a.k.a. if-conversion. The
branch instruction implied by control flow is removed, or rather never
emitted. Instead, code is emitted for both sides of the putative control
flow, and special instructions merge the effects and results into a
single dataflow which reflects the control flow path actually taken. The
special instructions include conditional move and several variations
that in effect implement the "?:" operator of C.

If-conversion is most valuable in wide architectures, where the
instructions on both sides can be merged into wide bundles that take no
longer to execute than would either path by itself. The advantage of
if-conversion is that the implied branch instruction doesn't exist
(saving its code space and execution power, and its space in the branch
prediction mechanism) and a miss-predict is impossible. The drawback is
the power used in execution instructions from the untaken path. In
modern hardware this excess power cost is increasingly irrelevant due to
leakage power in unused functional units.

Even in wide architectures, if-conversion is constrained in the amount
and kinds of work that can be merged. Instructions that permanently
alter machine state, such as stores, cannot be merged. And there is no
gain from merging paths that have so many instructions that the merge is
significantly longer than either single path plus branch overhead.

X86 does have a conditional move instruction, and so can be said to
"work this way". However, it will do so only if the compiler does
if-conversion as part of optimization.

Stefan Monnier

unread,

Sep 19, 2020, 9:23:53 AM9/19/20

to

Current OoO CPUs routinely have 100 or more instructions "in flight",
and given the usual proportion of branches, you can assume that these
instructions include a fair number of branches.

Let's assume a branch every 10 instructions. With the kind of eager
execution you describe, after 10 instructions, we start fetching both
paths, so after 10 instructions on both paths we now have 10 + 2 * 10 = 30
instructions in flight after the next branches we'll have
10 + 2 * (10 + 2 * 10) = 70 instructions in flight. Of those
presumably only 30 will be kept. At the next level, we get 150
instructions in flight of which only 40 will be kept.

This doesn't scale well. In theory, branches don't only diverge, they
also converge, so it should be possible to do better by noticing that
some of the N paths we're executing are the same, but it adds yet
more complexity.

So instead efforts have been put into improving branch prediction so
that most of those 100 instructions aren't thrown away, and the
"execution down both paths" is only performed by way of having the
compiler do if-conversion in those places where it seems beneficial.

Stefan

John Dallman

unread,

Sep 19, 2020, 9:39:45 AM9/19/20

to

In article <rk3uij$14pu$1...@gioia.aioe.org>, kliklakl...@gmail.com

There have been: I encountered a bug with doing this on the AMD K6-2,
which was released in 1998.

What happened was that the compiler was generating an x87 reverse divide
instruction, which is rarely used. There was a test for division by zero
in the source guarding the divide. The divide was speculatively executed
and the divide-by-zero trap was noted and stored until instruction retire
time. The guard test detected the divide-by-zero condition, so the
reverse divide was dropped, but the processor failed to clean up properly
and halted. As in really halted, the whole thing, which made debugging
slow work. This was AMD's explanation after we gave them the reproduction
case.

It was worked around with an AMD-written device driver that twiddled the
speculation settings. AMD were sure that it would wreck performance, but
we were unable to detect any difference. The Microsoft compiler of the
period didn't use reverse divides /often/, but there was no way to
prevent it happening.

John

John Levine

unread,

Sep 19, 2020, 11:48:12 AM9/19/20

to

In article <rk3uij$14pu$1...@gioia.aioe.org>,

VLIW architectures can do speculative execution through a chain of
branches and then undo or discard results for paths not taken. Look up
Trace Scheduling.

Back in the 1960s, the 360/91 fetched the instructions from both
paths. It didn't do speculative execution but, lacking a cache, that
decreased pipe refill delays.

See http://american.cs.ucdavis.edu/academic/readings/papers/ibm360-91.pdf

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

MitchAlsup

unread,

Sep 19, 2020, 1:14:56 PM9/19/20

to

On Saturday, September 19, 2020 at 8:23:53 AM UTC-5, Stefan Monnier wrote:
> > In another forum someone alleged that there are OoO-CPUs which execute
> > both paths of a branch speculatively and drop the path that turns out
> > to be wrong. Is there really such a CPU, or more: is there a x86-CPU
> > that works this way ?
>
> Current OoO CPUs routinely have 100 or more instructions "in flight",
> and given the usual proportion of branches, you can assume that these
> instructions include a fair number of branches.
>
> Let's assume a branch every 10 instructions. With the kind of eager
> execution you describe, after 10 instructions, we start fetching both
> paths, so after 10 instructions on both paths we now have 10 + 2 * 10 = 30
> instructions in flight after the next branches we'll have
> 10 + 2 * (10 + 2 * 10) = 70 instructions in flight. Of those
> presumably only 30 will be kept. At the next level, we get 150
> instructions in flight of which only 40 will be kept.

I agree with the logic presented above. I disagree with the branch density
presented above.

In a machine without predicates, a branch is seen 19.3% of the instructions
(roughly 1 in 5 instructions). Conditional/unconditional branches are 90% of
that 19.3%, CALL/RET is 4.4%, while Switch (JMP) is 2.2%.

Of the conditional branches:: those with backwards displacements 24% are taken
roughly 90% of the time, while those with forward displacements 67.5% are
taken 45.6% of the time.

So, one is "taking" a branch once every 8.53 instructions. Where "taking" means
the Instruction Pointer is modified in a non-sequential manner.

>
> This doesn't scale well. In theory, branches don't only diverge, they
> also converge, so it should be possible to do better by noticing that
> some of the N paths we're executing are the same, but it adds yet
> more complexity.

80%+ of unconditional branches lead to a convergence point.

>
> So instead efforts have been put into improving branch prediction so
> that most of those 100 instructions aren't thrown away, and the
> "execution down both paths" is only performed by way of having the
> compiler do if-conversion in those places where it seems beneficial.

If branch prediction had remained at 1995 levels (90% accuracy) machines
with execution windows greater than 90-odd instructions would not have been
built.
>
>
> Stefan

Ivan Godard

unread,

Sep 19, 2020, 1:31:47 PM9/19/20

to

On 9/19/2020 10:14 AM, MitchAlsup wrote:

>
> If branch prediction had remained at 1995 levels (90% accuracy) machines
> with execution windows greater than 90-odd instructions would not have been
> built.

And with LLC the size of present chips they shouldn't be building them
now, except for specialized apps where the working set doesn't fit in
the LLC. JMO, YMMV.

Bonita Montero

unread,

Sep 19, 2020, 2:00:21 PM9/19/20

to

> In a machine without predicates, a branch is seen 19.3% of the instructions
> (roughly 1 in 5 instructions). Conditional/unconditional branches are 90% of
> that 19.3%, CALL/RET is 4.4%, while Switch (JMP) is 2.2%.
> Of the conditional branches:: those with backwards displacements 24% are taken
> roughly 90% of the time, while those with forward displacements 67.5% are
> taken 45.6% of the time.

...

> 80%+ of unconditional branches lead to a convergence point.

...

> If branch prediction had remained at 1995 levels (90% accuracy) machines
> with execution windows greater than 90-odd instructions would not
have been
> built.

Sorry, there's are no such fixed numbers. These numbers are for sure
an average from someone who has analyzed a number of applications.
I think the portion of jumps is very different for different types
of code. The whole specifications are even more ridiculous because
of the fractions.

Stefan Monnier

unread,

Sep 19, 2020, 2:16:46 PM9/19/20

to

> I agree with the logic presented above.

Great!

> I disagree with the branch density presented above.

That's OK, I don't either. It was just a somewhat arbitrary number to
illustrate the point. I wanted it to be round and I hoped it would be
within a factor of 2 of a non-negligible chunk of real cases.

> So, one is "taking" a branch once every 8.53 instructions.

Great, that's pretty close to 10.

> Where "taking" means the Instruction Pointer is modified in
> a non-sequential manner.

Of course for the eager execution issue, we'd care more about the
branches that are conditional (regardless if we actually jump or not),
but I expect it's still within the same ballpark (somewhere between
5 and 20), right?

>> This doesn't scale well. In theory, branches don't only diverge, they
>> also converge, so it should be possible to do better by noticing that
>> some of the N paths we're executing are the same, but it adds yet
>> more complexity.
> 80%+ of unconditional branches lead to a convergence point.

So, assuming the hardware is somehow able to discover and reconcile
convergent paths, how many distinct paths would be executed at the same
time on average when we have say 200 instructions in flight?

Stefan

MitchAlsup

unread,

Sep 19, 2020, 2:35:34 PM9/19/20

to

With 8.5 I/T you get 200/8.5 = 23.52 which would round up to 24 or 32.
With a 95%-ile Branch predictor:: .95^24 = 29% That is, when the window
is full, on average, for each set of instructions inserted into
execution, 29% of the (between 1/3 and 1/4) will retire.

This looks like a excellent way to waste power.

You are ALSO having to make 2 branch predictions per cycle, but on the
good side, you can issue 8.5 I/C when you have 1 non-taken branch and
1 taken branch. You are probably going to have to be fetching from several
paths simultaneously in order to have instructions on the predicted paths
ready to enter execution.
>
>
> Stefan

MitchAlsup

unread,

Sep 19, 2020, 4:29:54 PM9/19/20

to

It is not just the size of the LLC it is also the number of cycles to access::

Core i7 Xeon 5500 Series Data Source Latency (approximate) [Pg. 22]

local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns )
local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns )
local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns )
local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns )
local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns )

In a server arrangement about 1/2 (50%) of the LLC accesses will be in the
shared or modified states and around 2/3rd will be in a remote LLC. Crunching
the above data gives an access time of 84 cycles to the average L3 access.

So even if you get a hit, you still have to be able to absorb 80-odd cycles of
latency; and if you take a LLC miss you eat another 300-400 cycles (which
cannot be absorbed*.)

{Not to mention the TLB misses which only exacerbate the latency.}
{Nor have we taken into account any interference between CPUs, threads, or
buses along the paths.}

(*) which is why almost no one is doing vector machines of CRAY style these
days.

Anton Ertl

unread,

Sep 19, 2020, 5:18:03 PM9/19/20

to

MitchAlsup <Mitch...@aol.com> writes:
>On Saturday, September 19, 2020 at 1:16:46 PM UTC-5, Stefan Monnier wrote:
>> So, assuming the hardware is somehow able to discover and reconcile
>> convergent paths, how many distinct paths would be executed at the same
>> time on average when we have say 200 instructions in flight?
>
>With 8.5 I/T you get 200/8.5 = 23.52 which would round up to 24 or 32.
>With a 95%-ile Branch predictor:: .95^24 = 29% That is, when the window
>is full, on average, for each set of instructions inserted into
>execution, 29% of the (between 1/3 and 1/4) will retire.
>
>This looks like a excellent way to waste power.

Now look at a prediction accuracy from this century:

0.98^24=0.62
0.99^24=0.79

So Intel has now extended the instruction window to 352 instructions;
if we assume 8.5inst/prediction, that's 41 branches:

0.98^41=0.44
0.99^41=0.66

MitchAlsup

unread,

Sep 19, 2020, 5:27:16 PM9/19/20

to

On Saturday, September 19, 2020 at 4:18:03 PM UTC-5, Anton Ertl wrote:

> MitchAlsup <?????@aol.com> writes:
> >On Saturday, September 19, 2020 at 1:16:46 PM UTC-5, Stefan Monnier wrote:
> >> So, assuming the hardware is somehow able to discover and reconcile
> >> convergent paths, how many distinct paths would be executed at the same
> >> time on average when we have say 200 instructions in flight?
> >
> >With 8.5 I/T you get 200/8.5 = 23.52 which would round up to 24 or 32.
> >With a 95%-ile Branch predictor:: .95^24 = 29% That is, when the window
> >is full, on average, for each set of instructions inserted into
> >execution, 29% of the (between 1/3 and 1/4) will retire.
> >
> >This looks like a excellent way to waste power.
>
> Now look at a prediction accuracy from this century:
>
> 0.98^24=0.62
> 0.99^24=0.79
>
> So Intel has now extended the instruction window to 352 instructions;
> if we assume 8.5inst/prediction, that's 41 branches:
>
> 0.98^41=0.44
> 0.99^41=0.66

Which, if you think about it is about right--you want at least 1/2 of the
work injected into execution to deliver something useful.

Or looking at it the other way:: you still throw away 1/2 of what you
injected !

Brett

unread,

Sep 19, 2020, 11:58:59 PM9/19/20

to

MitchAlsup <Mitch...@aol.com> wrote:
> On Saturday, September 19, 2020 at 8:23:53 AM UTC-5, Stefan Monnier wrote:
>>> In another forum someone alleged that there are OoO-CPUs which execute
>>> both paths of a branch speculatively and drop the path that turns out
>>> to be wrong. Is there really such a CPU, or more: is there a x86-CPU
>>> that works this way ?
>>
>> Current OoO CPUs routinely have 100 or more instructions "in flight",
>> and given the usual proportion of branches, you can assume that these
>> instructions include a fair number of branches.
>>
>> Let's assume a branch every 10 instructions. With the kind of eager
>> execution you describe, after 10 instructions, we start fetching both
>> paths, so after 10 instructions on both paths we now have 10 + 2 * 10 = 30
>> instructions in flight after the next branches we'll have
>> 10 + 2 * (10 + 2 * 10) = 70 instructions in flight. Of those
>> presumably only 30 will be kept. At the next level, we get 150
>> instructions in flight of which only 40 will be kept.
>
> I agree with the logic presented above. I disagree with the branch density
> presented above.
>
> In a machine without predicates, a branch is seen 19.3% of the instructions
> (roughly 1 in 5 instructions). Conditional/unconditional branches are 90% of
> that 19.3%, CALL/RET is 4.4%, while Switch (JMP) is 2.2%.
>
> Of the conditional branches:: those with backwards displacements 24% are taken
> roughly 90% of the time, while those with forward displacements 67.5% are
> taken 45.6% of the time.

These are the important ones to speculate both ways, short forward
branches.

You have lots of X=Y; if (Something) {X+=1} Store X;

In this case you have two X names and you need to decide which to store,
and then expire the loser. You turn short branches into predication.

You have a small rename cost and it is a power win when you look at the
branch fails, and also a performance win. It is only a matter of time until
CPU’s do this right?

Paul A. Clayton

unread,

Sep 20, 2020, 10:36:04 AM9/20/20

to

IBM's POWER7 used dynamic hammock predication for single instruction
forward branches. If a branch prediction was deemed low confidence, the
single instruction was converted into a predicated instruction. This is not
the same as general hammock predication (the limit to one instruction)
much less general eager execution.

(Dynamic predictation has a potential advantages over predication (without
predicate prediction), when the condition/branch is predictable, of removing
a data dependency and eliminating excess instructions. The instruction
cache can even be designed (like a trace cache) to increase spatial locality
when a predictable path is very temporally stable while allowing a change
to a different stable path or semi-stable/moderately predictable paths —
where less specializing would be used — or a highly unpredictable path —
where predication might be used. Of course, branchification of predicated
code would also be possible, but that transformation path seems more
complex. The use of predication for timing independence also makes
predicate prediction potentially problematic.)

MitchAlsup

unread,

Sep 20, 2020, 12:22:10 PM9/20/20

to

On Saturday, September 19, 2020 at 10:58:59 PM UTC-5, Brett wrote:

A good implementation of Predication covers this without any branching.

CMP Rt,Rsomething,Rk
PLT Rt,{ET}
MOV Rx,Ry
ADD Rx,Ry,#1
ST Rx,[somewhere]

>
> In this case you have two X names and you need to decide which to store,
> and then expire the loser. You turn short branches into predication.

As seen above. Good predication can cover more than a handful of instructions
and accommodate simple && and || strings.

>
> You have a small rename cost and it is a power win when you look at the
> branch fails, and also a performance win. It is only a matter of time until
> CPU’s do this right?

I think My 66000 does this right.

EricP

unread,

Sep 20, 2020, 1:02:47 PM9/20/20

to

(Google keeps insisting that "predication" is misspelt "prediction"
and redirecting my searches.)

There is something IBM calls "bc+8 conversion to predication" in POWER8.
In Power ISA a branch offset is relative to the branch instruction
so bc+8 means conditional branch over 1 instruction.

In POWER8 Processor User’s Manual for the Single-Chip Module, 2016

10.1.4.7 BC+8 Handling

In addition to the normal branch handling, bc instructions whose BO field
indicates only a CR dependency with a displacement of +8 are handled as a
special case. Such bc+8 instructions are marked as potential candidates
for conversion to predication. When the next instruction after a bc+8
is in the list of allowed to be predicated (see Table 10-3),
the bc+8 is marked as first for group formation processing.
Finally, if the branch prediction for the bc+8 instruction is predicted
not taken, the instruction pair is treated similar to an isel,
in that the branch cannot be mispredicted and the next instruction
is conditionally executed based on the bc outcome. Any time a bc+8
instruction is executed, the branch prediction algorithm writes a
value assuming the branch was not taken. This encourages all bc+8
instructions to be converted to predication. Use of a NOP can be used
to prevent bc+8 conversions in specific cases (making the bc+8 a bc+12).

Table 10-3. bc+8 Pairable Instructions
Assembler Mnemonic
addi addis add and or xor ori stb sth stw std

When a bc+8 instruction pair is converted to predication, the destination
register now is pending until the predication is resolved. This can mean
that subsequent dependent instructions are prevented from issuing.
The bc+8 conversions must be avoided for the following circumstances:
1. If the branch is expected to be very predictable, the branch
prediction hardware outperforms predication.
2. If the pairable operation is the start of a dependency chain
involving loads, sometimes it is more beneficial to have the
loads start execution even if the branch eventually flushes.
Predication delays dependent operations from executing.
Therefore, if a long latency load is delayed, the performance
is better served by allowing the branch to be predicted.
This is especially true when the resultant memory access can
be the same cache line in either outcome. Branch misprediction
can be hidden by the long latency memory reference while
the dependency delay due to predication cannot be hidden.
Heap sort algorithms show this effect.

Anton Ertl

unread,

Sep 20, 2020, 1:45:51 PM9/20/20

to

MitchAlsup <Mitch...@aol.com> writes:
>Which, if you think about it is about right--you want at least 1/2 of the
>work injected into execution to deliver something useful.

Well, of course it's better if more of the work is useful, but I doubt
that the 1/2 thing is the decisive criterion, although it does play a
role. My guess is that it's more important how big they can build the
ROB (or similar structure) and physical register files without
incurring an extra cycle or longer cycle times; or if they need an
extra cycle, how much IPC benefit the larger structure affords. And
eventually, whether there is a performance benefit while staying in
the same power envelope, and whether that benefit is worth the area
cost. How these criteria affect the useful-work-ratio limit is
unclear to me.

MitchAlsup

unread,

Sep 20, 2020, 2:18:17 PM9/20/20

to

On Sunday, September 20, 2020 at 12:45:51 PM UTC-5, Anton Ertl wrote:

> MitchAlsup <?????@aol.com> writes:
> >Which, if you think about it is about right--you want at least 1/2 of the
> >work injected into execution to deliver something useful.
>
> Well, of course it's better if more of the work is useful, but I doubt
> that the 1/2 thing is the decisive criterion, although it does play a
> role.

It (1/2) is not decisive, but it is where the slope becomes slippery.

> My guess is that it's more important how big they can build the
> ROB (or similar structure) and physical register files without
> incurring an extra cycle or longer cycle times; or if they need an
> extra cycle, how much IPC benefit the larger structure affords. And
> eventually, whether there is a performance benefit while staying in
> the same power envelope, and whether that benefit is worth the area
> cost.

As these structures get bigger they increase their power consumption.

When I looked at this in the early 1990s, integer applications saturated a
6-wide × 16 deep (96 in flight instructions) capturing all of the IPC that
was inherent in the compiled code (SPEC89); but FP applications did not
saturate until about 6×24 with a few stranglers at 6×32.

Keep in mind that the D$ was only 16KB direct mapped and main memory was
as close as 5 cycles away, and we had 0 cycle branch repair {If a conditional
branch executes in cycle k we can repair the machine state and start issuing
instructions in cycle k+1 on the alternate path--we fetched the alternate
path 1/2 cycle after fetching the packet containing the conditional branch.}

Brett

unread,

Sep 20, 2020, 5:31:55 PM9/20/20

to

Other posts in this thread say POWER 7 & 8 turn one instruction branching
with poor prediction into predication. And you can do the opposite and turn
well predicted predication into branching to reduce critical path length.

Programmers and compilers are generally incapable of making the right
choice between predication and prediction, especially as that can change
per CPU generation and code is almost never recompiled. The CPU needs to
make the final choice based on history.

Ivan Godard

unread,

Sep 20, 2020, 5:44:59 PM9/20/20

to

True only for architecture families that follow the antiquated model of
binary compatibility. This is no longer true for dynamic systems,
whether using JITs, binary translation, or install-time code generation.

Apple clearly understands this.

Brett

unread,

Sep 21, 2020, 2:46:00 AM9/21/20

to

Yes, Apple clearly has a big bet on bit code as a better alternative to
emulation on architecture changes, big or small. And Apple can detect code
crashes and switch, or patch.

John Levine

unread,

Sep 21, 2020, 1:39:16 PM9/21/20

to

In article <rk9i74$oh7$1...@dont-email.me>, Brett <gg...@yahoo.com> wrote:
>> True only for architecture families that follow the antiquated model of
>> binary compatibility. This is no longer true for dynamic systems,
>> whether using JITs, binary translation, or install-time code generation.
>>
>> Apple clearly understands this.
>
>Yes, Apple clearly has a big bet on bit code as a better alternative to
>emulation on architecture changes, big or small. And Apple can detect code
>crashes and switch, or patch.

They already switched from M68K to POWER to x86, that last move
requiring big- to little-endian changes. X86 to ARM is easy by
comparison.

Stefan Monnier

unread,

Sep 21, 2020, 3:26:42 PM9/21/20

to

>>> True only for architecture families that follow the antiquated model of
>>> binary compatibility. This is no longer true for dynamic systems,
>>> whether using JITs, binary translation, or install-time code generation.
>>> Apple clearly understands this.
>>Yes, Apple clearly has a big bet on bit code as a better alternative to
>>emulation on architecture changes, big or small. And Apple can detect code
>>crashes and switch, or patch.
> They already switched from M68K to POWER to x86, that last move
> requiring big- to little-endian changes. X86 to ARM is easy by
> comparison.

I can see two interpretations:
- Having had the experience, they know first-hand that recompiling is
a lot cheaper than writing a reliable emulator. And since they now
have 100% control over the software distribution they *can* get away
with recompilation this time.
- Contrary to the previous switches, a switch from X86 to ARM won't be
able to rely on the new CPUs being 2x (or more) faster than the
old ones to hide the cost of emulation.

-- Stefan

John Levine

unread,

Sep 21, 2020, 5:20:27 PM9/21/20

to

In article <jwvy2l30wc3.fsf-...@gnu.org>,

Stefan Monnier <mon...@iro.umontreal.ca> wrote:
>I can see two interpretations:
>- Having had the experience, they know first-hand that recompiling is
> a lot cheaper than writing a reliable emulator. And since they now
> have 100% control over the software distribution they *can* get away
> with recompilation this time.

No, you can install your own stuff on MacOS and it is quite common to
do so, e.g. python3 is not in the base system. There are at least two
popular port management packages, Macports and homebrew. On the other
hand, I expect them to keep up and be able to compile for both
architectures by the time it matters.

>- Contrary to the previous switches, a switch from X86 to ARM won't be
> able to rely on the new CPUs being 2x (or more) faster than the
> old ones to hide the cost of emulation.

That's true, but I think they'll do what they've done before, fat
binaries with code for both CPUs, and emulation.

Since the data formats on x86 and Arm are mostly the same, recompiling
will be straightforward other than corner cases like floating point
exception handling. Compatible data formats should also make it
possible to do JIT translation of larger chunks of object code so the
emulators will be fairly fast.

R's,
John

Niels Jørgen Kruse

unread,

Sep 22, 2020, 3:33:38 PM9/22/20

to

John Levine <jo...@taugh.com> wrote:

> In article <jwvy2l30wc3.fsf-...@gnu.org>,
> Stefan Monnier <mon...@iro.umontreal.ca> wrote:
> >I can see two interpretations:
> >- Having had the experience, they know first-hand that recompiling is
> > a lot cheaper than writing a reliable emulator. And since they now
> > have 100% control over the software distribution they *can* get away
> > with recompilation this time.
>
> No, you can install your own stuff on MacOS and it is quite common to
> do so, e.g. python3 is not in the base system. There are at least two
> popular port management packages, Macports and homebrew. On the other
> hand, I expect them to keep up and be able to compile for both
> architectures by the time it matters.
>
> >- Contrary to the previous switches, a switch from X86 to ARM won't be
> > able to rely on the new CPUs being 2x (or more) faster than the
> > old ones to hide the cost of emulation.
>
> That's true, but I think they'll do what they've done before, fat
> binaries with code for both CPUs, and emulation.
>
> Since the data formats on x86 and Arm are mostly the same, recompiling
> will be straightforward other than corner cases like floating point
> exception handling. Compatible data formats should also make it
> possible to do JIT translation of larger chunks of object code so the
> emulators will be fairly fast.

A relevant thread on RWT:
<https://www.realworldtech.com/forum/?threadid=192929>

--
Mvh./Regards, Niels Jørgen Kruse, Denmark

Bonita Montero

unread,

Sep 23, 2020, 10:49:53 AM9/23/20

to

> There is however an alternative approach, followed to some extent in
> many architectures: static path merging, a.k.a. if-conversion. The
> branch instruction implied by control flow is removed, or rather never
> emitted. Instead, code is emitted for both sides of the putative control
> flow, and special instructions merge the effects and results into a
> single dataflow which reflects the control flow path actually taken.
> The special instructions include conditional move and several variations
> that in effect implement the "?:" operator of C.

And which architecture is designed in that way ?
For me that makes no sense because the whole thing could be done by
the compiler.

Bonita Montero

unread,

Sep 23, 2020, 10:52:35 AM9/23/20

to

> What happened was that the compiler was generating an x87 reverse divide
> instruction, which is rarely used. There was a test for division by zero
> in the source guarding the divide. The divide was speculatively executed
> and the divide-by-zero trap was noted and stored until instruction retire
> time. The guard test detected the divide-by-zero condition, so the
> reverse divide was dropped, but the processor failed to clean up properly
> and halted. As in really halted, the whole thing, which made debugging
> slow work. This was AMD's explanation after we gave them the reproduction
> case.

That's an example of a processor speculatively executing an operation
and not correctly dropping the result. And not what I asked for.

MitchAlsup

unread,

Sep 23, 2020, 12:34:38 PM9/23/20

to

On Wednesday, September 23, 2020 at 9:49:53 AM UTC-5, Bonita Montero wrote:
> > There is however an alternative approach, followed to some extent in
> > many architectures: static path merging, a.k.a. if-conversion. The
> > branch instruction implied by control flow is removed, or rather never
> > emitted. Instead, code is emitted for both sides of the putative control
> > flow, and special instructions merge the effects and results into a
> > single dataflow which reflects the control flow path actually taken.
> > The special instructions include conditional move and several variations
> > that in effect implement the "?:" operator of C.
>
> And which architecture is designed in that way ?

Mill

Bonita Montero

unread,

Sep 23, 2020, 2:04:44 PM9/23/20

to

>> And which architecture is designed in that way ?

> Mill

So acutally no CPU.

MitchAlsup

unread,

Sep 23, 2020, 2:41:48 PM9/23/20

to

There remains hope.

Ivan Godard

unread,

Sep 23, 2020, 4:11:37 PM9/23/20

to

https://developer.arm.com/architectures/learn-the-architecture/armv8-a-instruction-set-architecture/program-flow-conditional-select-instructions

Bonita Montero

unread,

Sep 24, 2020, 3:02:51 AM9/24/20

to

>> And which architecture is designed in that way ?
>> For me that makes no sense because the whole thing could be done by
>> the compiler.

> https://developer.arm.com/architectures/learn-the-architecture/armv8-a-instruction-set-architecture/program-flow-conditional-select-instructions

That's a kind of conditional move and not what I was talking about.

MitchAlsup

unread,

Sep 24, 2020, 12:50:57 PM9/24/20

to

But a conditional move allows you to execute down both paths and then later
select which path should have been run !!!

Kli-Kla-Klawitter

unread,

Sep 24, 2020, 1:18:23 PM9/24/20

to

>> That's a kind of conditional move and not what I was talking about.

> But a conditional move allows you to execute down both paths and then later
> select which path should have been run !!!

What I was thinking about was the execution of multiple distinct
instructions.

Ivan Godard

unread,

Sep 24, 2020, 1:31:40 PM9/24/20

to

exactly: execute the instructions on both paths, and then select the
actual result at the end.

The dual-path instructions must be idempotent w/r/t machine state: no
exceptions, stores, or control flow. But it's quite common for a basic
block to do nothing but compute a result.

You even can design the architecture so that a larger fraction of
instructions are idempotent; Mill's metadata is an example, as it lets
FP instruction (which would otherwise be prohibited because they alter
the flags) or instructions that can overflow (which would throw a fault)
be used. Alternatively, if you are an OOO then you can have SELECT cause
the entire taken path to retire, exceptions and all.

MitchAlsup

unread,

Sep 24, 2020, 2:23:20 PM9/24/20

to

On Thursday, September 24, 2020 at 12:31:40 PM UTC-5, Ivan Godard wrote:
> On 9/24/2020 10:18 AM, Kli-Kla-Klawitter wrote:
> >>> That's a kind of conditional move and not what I was talking about.
> >
> >> But a conditional move allows you to execute down both paths and then
> >> later
> >> select which path should have been run !!!
> >
> > What I was thinking about was the execution of multiple distinct
> > instructions.
>
> exactly: execute the instructions on both paths, and then select the
> actual result at the end.
>
> The dual-path instructions must be idempotent w/r/t machine state: no
> exceptions, stores, or control flow. But it's quite common for a basic
> block to do nothing but compute a result.
>
> You even can design the architecture so that a larger fraction of
> instructions are idempotent; Mill's metadata is an example, as it lets
> FP instruction (which would otherwise be prohibited because they alter
> the flags) or instructions that can overflow (which would throw a fault)
> be used.

Since programs rarely look at the flags, If you can detect that nobody is
looking, then you can allow those FP calculations in the dual path.

Kli-Kla-Klawitter

unread,

Sep 25, 2020, 12:59:06 AM9/25/20

to

> The dual-path instructions must be idempotent w/r/t machine state: no
> exceptions, stores, or control flow. But it's quite common for a basic
> block to do nothing but compute a result.

These "dual-path instructions" are single instructions.

Ivan Godard

unread,

Sep 25, 2020, 3:21:58 AM9/25/20

to

Terminology issue: the instructions are each individual single
instructions, but they are drawn from two or more paths and then
interleaved. They must be idempotent because eventually all instructions
from all paths will be executed before the correct path is resolved.

Execution would not reflect program semantics if an instruction drawn
from a not-taken path caused a change to the machine state that was
visible after the path was resolved, so such changes must either be held
back until the path is resolved, or rolled back after resolution.
Generally OOO architectures do rollback and IO (especially static IO)
ones hold changes or avoid instructions that would make them, but many
use a mix of methods and the details vary widely.

Kli-Kla-Klawitter

unread,

Sep 25, 2020, 3:23:15 AM9/25/20

to

> Terminology issue: the instructions are each individual single
> instructions, but they are drawn from two or more paths and then
> interleaved. They must be idempotent because eventually all instructions
> from all paths will be executed before the correct path is resolved.

Sorry, I asked for execution of multiple code-paths and not for
a variant of a conditional move.

Ivan Godard

unread,

Sep 25, 2020, 3:42:46 AM9/25/20

to

Are you looking for micro-threads? That is, are you expecting to spawn a
separate thread (on a separate core) for each path after the branch?

Kli-Kla-Klawitter

unread,

Sep 25, 2020, 9:33:22 AM9/25/20

to

>> Sorry, I asked for execution of multiple code-paths and not for
>> a variant of a conditional move.

> Are you looking for micro-threads? That is, are you expecting to spawn
> a separate thread (on a separate core) for each path after the branch?

No, I've explained what I asked for.

EricP

unread,

Sep 25, 2020, 9:36:52 AM9/25/20

to

Try this as a starting point
(I haven't read it but the abstract sounds on-topic)
See Google for citations and references.

Multipath execution: opportunities and limits, 1998
https://collaborate.princeton.edu/en/publications/multipath-execution-opportunities-and-limits
http://mrmgroup.cs.princeton.edu/papers/ics_multipath.pdf

There is also

Threaded multiple path execution, 1998
https://ieeexplore.ieee.org/abstract/document/694778/
https://core.ac.uk/download/pdf/185236193.pdf

You might also look for papers on Itanium predication.

The impact of if-conversion and branch prediction on
program execution on the intel itanium processor, 2001
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.4114&rep=rep1&type=pdf

There are also tons of stuff on tangentially related topics
like "strands", "nano threads", "decoupled architectures"
which seem concerned mostly with cache misses rather than
speculating down alternate paths.

Kli-Kla-Klawitter

unread,

Sep 25, 2020, 9:48:16 AM9/25/20

to

> There are also tons of stuff on tangentially related topics
> like "strands", "nano threads", "decoupled architectures"
> which seem concerned mostly with cache misses rather than
> speculating down alternate paths.

But what I asked for is if there is actually a OoO-CPU that executes
both parts beyond a branch speculatively.

Ivan Godard

unread,

Sep 25, 2020, 11:24:46 AM9/25/20

to

And the answer is yes - but the way they do that you reject as being
"just a conditional move". So you must mean something else by "both
parts speculatively", and it's not clear what you are asking.

Perhaps you could describe how the machine you have in mind would work,
and we can tell you whether it's been done?

Stephen Fuld

unread,

Sep 25, 2020, 11:32:44 AM9/25/20

to

While the if conversion/conditional move/select/pick can sometimes be
the logical equivalent, I believe he is asking about an actual
conditional branch instruction (not even predication), where there are
two possible flows of control, taken and not taken, and the CPU
conditionally executes both paths before the conditional branch is
resolved and you know which path is correct.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

EricP

unread,

Sep 25, 2020, 12:02:08 PM9/25/20

to

For current commercial cpus, as far as I know the answer is no
because of the resource cost. There have been some research machines.

In the search of the references of those papers I mentioned
was reference to a chapter in the book
"Speculative Execution in High Performance Computer Architectures" 2005

chapter 6 titled "Multipath Execution"
http://www.ece.neu.edu/groups/nucar/CRCBook/Drafts/gus.pdf

which discusses the various different approached (there are many)
and their problems. It also lists the following as examples:

Hardware: Classically-Based
IBM 360/91,
Y-Pipe machine,
Polypath machine,
PrincePath Princeton multiPath

Non-classical hardware:
Condel-2 -based static DEE tree
Levo - Current DEE
ABT Adaptive Branch Trees machine
Dynamic Conditional Execution (DCE)

For Chip MultiProcessors (CMP) there fine grain and coarse grain.

The paper contains citations to the above that you can chase with Google.

MitchAlsup

unread,

Sep 25, 2020, 12:24:17 PM9/25/20

to

The Mc 88120 would fetch and issue instructions on the predicted path and
it would fetch and store in a buffer instructions on the alternate path.
If a branch was mispredicted the instructions on the alternate path were
available for issue the cycle after the branch decided it was mispredicted.

So, we did fetch both paths, but we only executed one of them, holding the
other in reserve "just in case".

Terje Mathisen

unread,

Sep 25, 2020, 3:30:53 PM9/25/20

to

When you as the programmer or user cannot detect if it does this or not,
does it really matter?

What matters, and what cpu architects target, is to run the program to
completion in the minimum time and with minimum power usage.

What Mitch and others here have told you is that at this point in time,
due to the performance of branch predictors it really doesn't make sense
to do what you're asking for.

Back in 1995 I met some of the Intel PentiumPro architects in Santa
Clara, about to get an NDA disclosure about the coming new cpu: Before
that disclosure started I spent 10-15 min on the white board telling
them what I had already figured out from reasoning about what info was
available from open sources, it turned out I got pretty much every
detail correct, _except_ for predicting that it would have eager
execution (i.e. executing both paths of a branch until the condition was
known).

The reality was that the huge BP improvements relative to the Pentium
meant that dual-path execution really didn't make any sense.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

EricP

unread,

Sep 25, 2020, 5:32:16 PM9/25/20

to

This web page which discusses eager execution on various IBM processors,
which he says were _mistakenly_ thought to do. However he says they
fetch but do not decode or execute the alternate path.

Eager Execution / Dual Path / Multiple Path, Mark Smotherman July 2010
https://people.cs.clemson.edu/~mark/eager.html

Kli-Kla-Klawitter

unread,

Sep 25, 2020, 7:38:05 PM9/25/20

to

>> But what I asked for is if there is actually a OoO-CPU that executes
>> both parts beyond a branch speculatively.

> And the answer is yes - but the way they do that you reject as being

> "just a conditional move". ...

That's not what I asked for.
The conditional move is a single instruction.

MitchAlsup

unread,

Sep 25, 2020, 8:12:32 PM9/25/20

to

My 66000 architecture supports Predicated Execution where the PRED instruction
casts a shadow over up to 8 instructions, the instructions in the shadow can be
from the Then-clause or from the Else-clause.

All instructions in the shadow are inserted into execute and are dependent
on the resolution of the PRED instruction. Some implementations can execute
instructions from both clauses but no implementation can deliver any results
from the shadow until PRED has resolved.

The shadow was described in this manner in order to avoid disrupting the
sequential flow of instructions from fetch to execute whenever possible.
When more than 8 instructions are involved, one reverts of std conditional
branching.

It is possible to PRED multiple times and implement && and || logic trees
without branching and still have up to 8 instructions in the shadow of the
multiple PREDs.

So, are 8 instructions good enough to answer your question?

Stephen Fuld

unread,

Sep 25, 2020, 11:43:35 PM9/25/20

to

On 9/25/2020 12:30 PM, Terje Mathisen wrote:
> Kli-Kla-Klawitter wrote:
>>> There are also tons of stuff on tangentially related topics
>>> like "strands", "nano threads", "decoupled architectures"
>>> which seem concerned mostly with cache misses rather than
>>> speculating down alternate paths.
>>
>> But what I asked for is if there is actually a OoO-CPU that executes
>> both parts beyond a branch speculatively.
>
> When you as the programmer or user cannot detect if it does this or not,
> does it really matter?
>
> What matters, and what cpu architects target, is to run the program to
> completion in the minimum time and with minimum power usage.
>
> What Mitch and others here have told you is that at this point in time,
> due to the performance of branch predictors it really doesn't make sense
> to do what you're asking for.

Exactly right. While it perhaps made sense many years ago, it doesn't
now. I would only add the implementation of things like CMOV and
friends and predication are also contributing factors to adding hardware
to execute both sides of a conditional branch being a bad idea.

EricP

unread,

Sep 26, 2020, 9:57:18 AM9/26/20

to

I'm not so sure about that as the problems may be due to CMOV's limitations.

On Alpha CMOVcc is either
if (condition(rs1)) rd = rs2
if (condition(rs1)) rd = uimm8

That means that it executes all instructions on both side of a branch
and only the final value merge (phi function) is conditional.
It can move an immediate, but that is only an 8-bit zero extended value.
It does have a floating FCMOV but just between registers.
On the plus side, since condition(rs1) uses a register as the predicate
it can have multiple conditions.

x86 CMOVcc reg, r/m is slightly better as the source r/m can be [mem]
so it can do a conditional memory load, which should skip the whole
memory cycle if the predicate is false.
It does have a floating FCMOV but only for the now deprecated x87.
On the con side it is dependent on flags for the predicate so
there can only be one condition at a time - for multiple predicate
conditions you have to keep TEST'ing to set flags first.
Same flags limitation for FCMOV.

Neither CMOV can eliminate expensive operations like MUL or DIV,
and neither can do a conditional store memory.

ARM in its original 32-bit ISA had a 4-bit predicate field on most
instructions including MUL, LD and ST but again based on its
single flags register. This feature was dropped for ARMv7 16-bit
instructions and replaced with the IT (If-Then) predicate instruction
with a 4-instruction shadow.

Because CMOV's are so limited, and make one pay the full cost for
both paths of execution, and neither can do a conditional store,
I don't think that is a fair test of predication. The more general
predication on Itanium or maybe older ARMv6 would be a better guide
as to what is potentially possible.

Anton Ertl

unread,

Sep 26, 2020, 11:11:48 AM9/26/20

to

EricP <ThatWould...@thevillage.com> writes:
>On Alpha CMOVcc is either
> if (condition(rs1)) rd = rs2
> if (condition(rs1)) rd = uimm8
>
>That means that it executes all instructions on both side of a branch
>and only the final value merge (phi function) is conditional.

...

>x86 CMOVcc reg, r/m is slightly better as the source r/m can be [mem]
>so it can do a conditional memory load, which should skip the whole
>memory cycle if the predicate is false.

I that any implementation works that way.

>Neither CMOV can eliminate expensive operations like MUL or DIV,
>and neither can do a conditional store memory.

Yes, the conditional store is one thing that is really missing.

And a conditional DIV on IA-32 and AMD64, too, because on these
architectures DIV can trap (no DIV on Alpha anyway).

But apart from that, the instruction is fine even for the purpose you
have in mind, which is not the purpose of the OP: If the CMOV is
implemented in a way that waits for the condition and then waits for
the selected input (and not the other input), the latency of the
unselected input does not pay a role. They will still consume
resources, so if you want to avoid that, predicated execution is
probably a better approach.

In actual implementations, AFAIK CMOV waits for all inputs, and
predicated execution performs the operation unconditionally, only
canceling the result if the predicate is false.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Terje Mathisen

unread,

Sep 26, 2020, 11:22:29 AM9/26/20

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> On Alpha CMOVcc is either
>> if (condition(rs1)) rd = rs2
>> if (condition(rs1)) rd = uimm8
>>
>> That means that it executes all instructions on both side of a branch
>> and only the final value merge (phi function) is conditional.
> ...
>> x86 CMOVcc reg, r/m is slightly better as the source r/m can be [mem]
>> so it can do a conditional memory load, which should skip the whole
>> memory cycle if the predicate is false.
>
> I that any implementation works that way.
>
>> Neither CMOV can eliminate expensive operations like MUL or DIV,
>> and neither can do a conditional store memory.
>
> Yes, the conditional store is one thing that is really missing.

On the SIMD side there is the very useful (but slightly slow?) masked
move which can store an arbitrary set of bytes from a 16/32/64 byte SIMD
register. This is perfect for handling the final block of a variable
length operation like strcpy(), but can also be used as a predicated
store of any other data type.

>
> And a conditional DIV on IA-32 and AMD64, too, because on these
> architectures DIV can trap (no DIV on Alpha anyway).
>
> But apart from that, the instruction is fine even for the purpose you
> have in mind, which is not the purpose of the OP: If the CMOV is
> implemented in a way that waits for the condition and then waits for
> the selected input (and not the other input), the latency of the
> unselected input does not pay a role. They will still consume
> resources, so if you want to avoid that, predicated execution is
> probably a better approach.
>
> In actual implementations, AFAIK CMOV waits for all inputs, and
> predicated execution performs the operation unconditionally, only
> canceling the result if the predicate is false.

Afaik CMOVcc RAX,RBX is implemented as RAX = cc ? RAX : RBX, so both
inputs must be available and the cc acts as a mux control signal.

EricP

unread,

Sep 26, 2020, 12:13:24 PM9/26/20

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> On Alpha CMOVcc is either
>> if (condition(rs1)) rd = rs2
>> if (condition(rs1)) rd = uimm8
>>
>> That means that it executes all instructions on both side of a branch
>> and only the final value merge (phi function) is conditional.

> ....

>> x86 CMOVcc reg, r/m is slightly better as the source r/m can be [mem]
>> so it can do a conditional memory load, which should skip the whole
>> memory cycle if the predicate is false.
>
> I that any implementation works that way.
>
>> Neither CMOV can eliminate expensive operations like MUL or DIV,
>> and neither can do a conditional store memory.
>
> Yes, the conditional store is one thing that is really missing.
>
> And a conditional DIV on IA-32 and AMD64, too, because on these
> architectures DIV can trap (no DIV on Alpha anyway).
>
> But apart from that, the instruction is fine even for the purpose you
> have in mind, which is not the purpose of the OP: If the CMOV is
> implemented in a way that waits for the condition and then waits for
> the selected input (and not the other input), the latency of the
> unselected input does not pay a role. They will still consume
> resources, so if you want to avoid that, predicated execution is
> probably a better approach.

Not quite what I am looking for as this does not eliminate the
alternate path cost, or at least prune it, after the condition resolves.
Also the x86 and ARMv6 use of flags limits the use.

Pruning would be difficult with CMOV because the cpu doesn't
know which of the earlier instructions are going to be conditional
until it decodes a CMOV by which time they are already gone.
A PRED instruction however explicitly tells the decoder which
instructions are dependent on it so it can start to build and
track a dependency chain dynamically.

Its the predication with execution pruning that effects the same
as the multi-path of the original posters' question.
The big difference is that with multi-path the cpu has to guess
what it should do, with a PRED instruction there is no guessing.

> In actual implementations, AFAIK CMOV waits for all inputs, and
> predicated execution performs the operation unconditionally, only
> canceling the result if the predicate is false.
>
> - anton

There are 2 ways to implement OoO CMOV rd,rs1,rs2.
The first uses a single uOp and requires 3 source and 1 dest
registers - the extra source is to copy the original value of rd
from its old physical register if the condition is false
because rename always assigns a new physical dest.
The second way is to emit 2 uOps, one for the True condition
and the one for False to do an ordinary move old_rd to new_rd.

The above method can be applied generally to many predicated opcodes
not just CMOV. The price for the 2 uOp approach is obviously double
the number of uOps in the queue, half of which are effectively NOPs.
The price for the 1 uOp approach is distribution of the
predicate dependencies to all uOps in the ROB or Res Stns
and propagating as each predicate value resolves.

But it is not quite as simple as disabled uOps being NOPs as there
can be cleanup or other actions associated with the NOP result.
Rather each potentially predicated uOp can be Enable_Always,
EnableIf_True, or EnableIf_False. If EN_True or EN_False
then there is a predicate condition flag dependency to track.
When the pred flag resolves, each uOp has OnEnable or OnDisable
actions to perform, with the simplest OnDisable action being to
break the dependency of the OnEnable source operands,
copy the old_phy_dest to new_phy_dest and mark uOp as Done.

MitchAlsup

unread,

Sep 26, 2020, 1:04:48 PM9/26/20

to

On Saturday, September 26, 2020 at 10:11:48 AM UTC-5, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >On Alpha CMOVcc is either
> > if (condition(rs1)) rd = rs2
> > if (condition(rs1)) rd = uimm8
> >
> >That means that it executes all instructions on both side of a branch
> >and only the final value merge (phi function) is conditional.
> ...
> >x86 CMOVcc reg, r/m is slightly better as the source r/m can be [mem]
> >so it can do a conditional memory load, which should skip the whole
> >memory cycle if the predicate is false.
>
> I that any implementation works that way.
>
> >Neither CMOV can eliminate expensive operations like MUL or DIV,
> >and neither can do a conditional store memory.
>
> Yes, the conditional store is one thing that is really missing.

This is one reason the PREDication of My 66000 can put essentially
Any instruction in the shadow of a predicate (or several && or ||).

>
> And a conditional DIV on IA-32 and AMD64, too, because on these
> architectures DIV can trap (no DIV on Alpha anyway).
>
> But apart from that, the instruction is fine even for the purpose you
> have in mind, which is not the purpose of the OP: If the CMOV is
> implemented in a way that waits for the condition and then waits for
> the selected input (and not the other input), the latency of the
> unselected input does not pay a role. They will still consume
> resources, so if you want to avoid that, predicated execution is
> probably a better approach.

The reason one performs if-conversion is to avoid branching and especially
high mispredict rate conditional branching. Here it is useful to waste the
resource in order to accrue savings elsewhere (typically fetch momentum.)
That is:: it is cheaper to perform if-conversion and pay foe the resources
than to avoid the resources and pay for branch recovery!

>
> In actual implementations, AFAIK CMOV waits for all inputs, and
> predicated execution performs the operation unconditionally, only
> canceling the result if the predicate is false.

With Predication shadows, one can throw out the non-executing instructions
the instant the predicate becomes known. So, both sides wait, and then one
side disappears.

Ivan Godard

unread,

Sep 26, 2020, 2:10:35 PM9/26/20

to

Mill is something of an extreme case here, having both predication and
selection. Because of Mill's metadata, the only instructions that need
predicated forms are those that explicitly alter machine state: memory
and control flow. There's no need to predicate an add, even an FP one.

Because predication is sparse, and because the compiler jams both sides
of branches very often and the jammed paths execute interleaved, it's
relatively rare for two instructions under the same predicate to appear
in the same instruction bundle. Consequently a shadow-casting approach
like Mitch's is unnecessary; you rarely need to predicate two ops in the
same shadow.

Selection is done by the pick() instruction, a classic "?:" operation
that can be thought of as a pair of CMOVs. I agree with Mitch's approach
of doing the work in the data path rather than the FUs, so pick is
implemented as a mux control, not as an operation, and as a result has
zero latency.

Mill predication and selection has turned out to be quite effective:
over a sample of code from our test suite, we see a collapse ratio of ~4
from input basic blocks to executed hyperblocks (a.k.a ebb or extended
basic block). That is equivalent to executing both ways on branches to a
depth of two.

It is easy to find code examples where block collapse has produce a
hyperblock with a longer latency than the actual trace of execution
under perfect prediction, especially in Copper and smaller
configurations that lack the width to blend multiple paths well. In real
predictors, however, the freedom from misses in prediction and in the
icache outweighs the increased latency.

Anton Ertl

unread,

Sep 27, 2020, 8:34:33 AM9/27/20

to

EricP <ThatWould...@thevillage.com> writes:
>Pruning would be difficult with CMOV because the cpu doesn't
>know which of the earlier instructions are going to be conditional
>until it decodes a CMOV by which time they are already gone.

In a modern OoO CPU, they usually are still waiting in some dispatch
unit by the time the CMOV is decoded.

However, on a register machine you only know that the instructions may
be conditional when the target register is overwritten after being
used only by an instruction that is CMOV or is conditional in the same
sense. Such things are much easier to track on stack machines.

>A PRED instruction however explicitly tells the decoder which
>instructions are dependent on it so it can start to build and
>track a dependency chain dynamically.
>
>Its the predication with execution pruning that effects the same
>as the multi-path of the original posters' question.
>The big difference is that with multi-path the cpu has to guess
>what it should do, with a PRED instruction there is no guessing.

If you wait until the condition is known, you pay for the good
utilization with more execution time latency.

Also, instruction fetching, decoding and register renaming costs, too,
probably more than execution.

So AFAIK all implementations of predication actually execute the
instruction; they only don't write back the result (if there is no
register renaming), or they select the other input (with register
renaming).

If you really want to maximize the proportion of architectural work,
the way to go is to use an in-order CPU that stalls on branches until
the target is known, and known to be executed. The resulting CPU will
be a lot slower than current high-end CPUs, though.

JimBrakefield

unread,

Sep 27, 2020, 12:41:14 PM9/27/20

to

Keep wondering about making stack machines more compatible with OoO: use stack as write once until the return which discards all the intermediate values leaving the final result (before subroutine entry stack space is reserved for the results, and then the operands). E.g. don't economize stack usage. One ISA implementation is for the instructions to merge the stack operation with a pick of one of the operands.

EricP

unread,

Sep 27, 2020, 3:07:10 PM9/27/20

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> Pruning would be difficult with CMOV because the cpu doesn't
>> know which of the earlier instructions are going to be conditional
>> until it decodes a CMOV by which time they are already gone.
>
> In a modern OoO CPU, they usually are still waiting in some dispatch
> unit by the time the CMOV is decoded.

They _may_ be waiting, or they may have completed.
Decode doesn't know.

And the dependency matrix is currently designed for forwarding
not backtracking. And it is built after-the-fact when we know
that r0 is an input to produce a new r1.

A CMOV that conditionally moves r0 to r1 tells us that the
old r1 is dead but says nothing about r0, so we might prune
r1's source tree but not r0. Its only when r0 is overwritten
can we know it was dead and could have been pruned.

Hmmm... a comes-from scoreboard matrix for physical registers might do.

r2 = r1 + r0 sets FF flags r1,r0 in row r1, with the + inst #.
r5 = r4 - r3 sets FF flags r4,r3 in row r5, with the - inst #.

Walking the matrix backwards tells us which instruction # to try
to shoot down, if not already done. But that backwards trace is
going to expand ~2^n source physical registers per iteration.

We can only kill paths to physical registers where there
are no other readers of that register.
And only ones where the physical register is no longer the
architectural register (the architectural reg has been overwritten).
And its only useful if the instruction that writes the physical reg
is still pending.

So its iterative with 1 reg per step, not parallel.
Also needs some way to shoot down individual instructions.

I'll call it maybe possible but I've probably missed something important,
like does this phy reg * phy reg (say 64*64=4096 bits) comes-from matrix
have to be checkpointed for every conditional branch (I suspect so).

Sounds slow and expensive anyway.

> However, on a register machine you only know that the instructions may
> be conditional when the target register is overwritten after being
> used only by an instruction that is CMOV or is conditional in the same
> sense. Such things are much easier to track on stack machines.
>
>> A PRED instruction however explicitly tells the decoder which
>> instructions are dependent on it so it can start to build and
>> track a dependency chain dynamically.
>>
>> Its the predication with execution pruning that effects the same
>> as the multi-path of the original posters' question.
>> The big difference is that with multi-path the cpu has to guess
>> what it should do, with a PRED instruction there is no guessing.
>
> If you wait until the condition is known, you pay for the good
> utilization with more execution time latency.

It doesn't _have to_ wait for the PRED condition to resolve.
It could speculatively execute dependents and forward the results.
If the predicate is unresolved I would give them lowest issue priority
so they only contend for resources when nothing else would use them.

MitchAlsup

unread,

Sep 27, 2020, 3:42:19 PM9/27/20

to

Which, by the way, is what the WAR (result delivery) part of a Scoreboard
does.

> And only ones where the physical register is no longer the
> architectural register (the architectural reg has been overwritten).
> And its only useful if the instruction that writes the physical reg
> is still pending.
>
> So its iterative with 1 reg per step, not parallel.

You can compute the transitive closure of this at issue time, in rename
space, making the search no worse than logarithmic.

> Also needs some way to shoot down individual instructions.

Scoreboards make this part easy.

>
> I'll call it maybe possible but I've probably missed something important,
> like does this phy reg * phy reg (say 64*64=4096 bits) comes-from matrix
> have to be checkpointed for every conditional branch (I suspect so).

The matrix does not have to be checkpointed! The constraint is that the
matrix has be be built in a known order, and backtracked to a known point.
So, you do this by instruction number not by phys reg number, with the
instruction number having a path to the phys reg number.

>
> Sounds slow and expensive anyway.

A reservation station machine with 64 dynamic instructions in the execution
window will have {3+2+2+2}*16*{64+8+3+7} = 11520 bits (in comparison) or
nearly 3× more.

>
> > However, on a register machine you only know that the instructions may
> > be conditional when the target register is overwritten after being
> > used only by an instruction that is CMOV or is conditional in the same
> > sense. Such things are much easier to track on stack machines.
> >
> >> A PRED instruction however explicitly tells the decoder which
> >> instructions are dependent on it so it can start to build and
> >> track a dependency chain dynamically.
> >>
> >> Its the predication with execution pruning that effects the same
> >> as the multi-path of the original posters' question.
> >> The big difference is that with multi-path the cpu has to guess
> >> what it should do, with a PRED instruction there is no guessing.
> >
> > If you wait until the condition is known, you pay for the good
> > utilization with more execution time latency.
>
> It doesn't _have to_ wait for the PRED condition to resolve.
> It could speculatively execute dependents and forward the results.

If your execution window supports "instruction replay" you can predict
one (side) of CMOV and replay the other side if necessary. LDs and STs
which miss the L1 are often replayed.

Side note: In Mc 88120 we had a FP DIV instruction that was 1/4 ULP
error at 12 cycles and it took until 17 cycles to have IEEE accuracy,
so we delivered the result at 12 and then delivered the result again
if precise rounding changed the bit pattern. This resulted in a pretty
good speedup since when performing FDIV almost every instruction
sequence becomes dependent on that FDIV and its latency.

Terje Mathisen

unread,

Sep 28, 2020, 2:43:26 AM9/28/20

to

Anton Ertl wrote:
> So AFAIK all implementations of predication actually execute the
> instruction; they only don't write back the result (if there is no
> register renaming), or they select the other input (with register
> renaming).
>
> If you really want to maximize the proportion of architectural work,
> the way to go is to use an in-order CPU that stalls on branches until
> the target is known, and known to be executed. The resulting CPU will
> be a lot slower than current high-end CPUs, though.

The canonical way to do this is to use wide multi-threading, so that
each thread only uses the core every N cycles, at which time branches
have been resolved, all reg-reg operations (including FMAC) have
finished. Include load-op as well with zero penalty as long as you have
a $L1 hit.

Afair SPARC had a server chip like this, the original Larrabee chip ran
4 such threads and there have probably been many others tempted by the
obvious throughput target. LRB also had per-SIMD-channel predicates on
everything in order to make it far easier to handle the first/last
partial block in any vector op.

Terje Mathisen

unread,

Sep 28, 2020, 2:52:12 AM9/28/20

to

MitchAlsup wrote:
> If your execution window supports "instruction replay" you can predict
> one (side) of CMOV and replay the other side if necessary. LDs and STs
> which miss the L1 are often replayed.
>
> Side note: In Mc 88120 we had a FP DIV instruction that was 1/4 ULP
> error at 12 cycles and it took until 17 cycles to have IEEE accuracy,
> so we delivered the result at 12 and then delivered the result again
> if precise rounding changed the bit pattern. This resulted in a pretty
> good speedup since when performing FDIV almost every instruction
> sequence becomes dependent on that FDIV and its latency.

That's an interesting optimization! Sounds like you had an NR style
divider, so you needed an additional iteration to get to the 2N+4 bits
that would guarantee correct rounding, but on the previous iteration you
had a N+2 for a max error of 1/4 ULP meaning that statistically you had
much better than 75% (95+%?) chance of being exactly right?

Ivan Godard

unread,

Sep 28, 2020, 10:22:35 AM9/28/20

to

That's a "barrel processor". The CDC Peripheral Processors worked like that.

MitchAlsup

unread,

Sep 28, 2020, 10:50:11 AM9/28/20

to

On Monday, September 28, 2020 at 1:43:26 AM UTC-5, Terje Mathisen wrote:
> Anton Ertl wrote:
> > So AFAIK all implementations of predication actually execute the
> > instruction; they only don't write back the result (if there is no
> > register renaming), or they select the other input (with register
> > renaming).
> >
> > If you really want to maximize the proportion of architectural work,
> > the way to go is to use an in-order CPU that stalls on branches until
> > the target is known, and known to be executed. The resulting CPU will
> > be a lot slower than current high-end CPUs, though.
>
> The canonical way to do this is to use wide multi-threading, so that
> each thread only uses the core every N cycles,

If N is fixed, you have a barrel processor.

> at which time branches
> have been resolved, all reg-reg operations (including FMAC) have
> finished. Include load-op as well with zero penalty as long as you have
> a $L1 hit.

Denelcor HEP (1-wide) had multiple threads contending for execution,
but only after the previous instruction was known to have completed,
This included not only long instructions like FDIV, but also LDs,...

>
> Afair SPARC had a server chip like this,

Niagra

> the original Larrabee chip ran
> 4 such threads and there have probably been many others tempted by the
> obvious throughput target. LRB also had per-SIMD-channel predicates on
> everything in order to make it far easier to handle the first/last
> partial block in any vector op.

It is all dependent on how one structures the ports on the register file(s).

MitchAlsup

unread,

Sep 28, 2020, 10:52:06 AM9/28/20

to

On Monday, September 28, 2020 at 1:52:12 AM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > If your execution window supports "instruction replay" you can predict
> > one (side) of CMOV and replay the other side if necessary. LDs and STs
> > which miss the L1 are often replayed.
> >
> > Side note: In Mc 88120 we had a FP DIV instruction that was 1/4 ULP
> > error at 12 cycles and it took until 17 cycles to have IEEE accuracy,
> > so we delivered the result at 12 and then delivered the result again
> > if precise rounding changed the bit pattern. This resulted in a pretty
> > good speedup since when performing FDIV almost every instruction
> > sequence becomes dependent on that FDIV and its latency.
>
> That's an interesting optimization! Sounds like you had an NR style
> divider, so you needed an additional iteration to get to the 2N+4 bits
> that would guarantee correct rounding, but on the previous iteration you
> had a N+2 for a max error of 1/4 ULP meaning that statistically you had
> much better than 75% (95+%?) chance of being exactly right?

First several iterations were Goldschmidt, with a NR correction step at the end.

We measured 80%-ile correct.

Anton Ertl

unread,

Sep 28, 2020, 12:44:39 PM9/28/20

to

Terje Mathisen <terje.m...@tmsw.no> writes:

>Anton Ertl wrote:
>> If you really want to maximize the proportion of architectural work,
>> the way to go is to use an in-order CPU that stalls on branches until
>> the target is known, and known to be executed. The resulting CPU will
>> be a lot slower than current high-end CPUs, though.
>
>The canonical way to do this is to use wide multi-threading, so that
>each thread only uses the core every N cycles, at which time branches
>have been resolved, all reg-reg operations (including FMAC) have
>finished. Include load-op as well with zero penalty as long as you have
>a $L1 hit.

SMT optimizes pipeline throughput for parallel code at the cost of
lower single-thread performance. I wonder if this really pays off
these days. ARM does not do SMT on the Cortex A55.

>Afair SPARC had a server chip like this, the original Larrabee chip ran
>4 such threads and there have probably been many others tempted by the
>obvious throughput target.

The Tera MTA is the most extreme, with 128 threads per CPU, intended
to cover main memory latency, without using caches.

This kind of design is attractive for embarrassingly parallel loads if
functional units are expensive, but somehow that is no longer the
case. OTOH, having 128 simple single-issue CPUs connected to a Tera
MTA-like memory controller probably results in complexity in
arbitrating access to the memory controller.

Anyway, the mainstream has moved elsewhere. Thanks to Amdahl's law,
you also want to have good single-thread performance. This includes
Sun/Oracle. The last few revisions of Oracle SPARC were high-clocked
OoO implementations. They kept having 8 threads/core, however.

MitchAlsup

unread,

Sep 28, 2020, 1:00:12 PM9/28/20

to

On Monday, September 28, 2020 at 11:44:39 AM UTC-5, Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
> >Anton Ertl wrote:
> >> If you really want to maximize the proportion of architectural work,
> >> the way to go is to use an in-order CPU that stalls on branches until
> >> the target is known, and known to be executed. The resulting CPU will
> >> be a lot slower than current high-end CPUs, though.
> >
> >The canonical way to do this is to use wide multi-threading, so that
> >each thread only uses the core every N cycles, at which time branches
> >have been resolved, all reg-reg operations (including FMAC) have
> >finished. Include load-op as well with zero penalty as long as you have
> >a $L1 hit.
>
> SMT optimizes pipeline throughput for parallel code at the cost of
> lower single-thread performance. I wonder if this really pays off
> these days. ARM does not do SMT on the Cortex A55.

Can you change "optimizes" into "improves", even after conversion to SIMD
the code is not "optimal" and this is why they widen the width of SIMD
every other generation. At best it is a local optimal which changes over
generations.

Terje Mathisen

unread,

Sep 28, 2020, 2:18:43 PM9/28/20

to

So as long as the penalty for having to resubmit the result was less
than about 15-20 cycles (5 cycles saved on 4/5, with a 1/5 chance of
having to restart) this would be a win, and from everything else you've
written I'm guessing sure the resubmit cost was just a cycle or two?

MitchAlsup

unread,

Sep 28, 2020, 3:17:41 PM9/28/20

to

On Monday, September 28, 2020 at 1:18:43 PM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Monday, September 28, 2020 at 1:52:12 AM UTC-5, Terje Mathisen wrote:
> >> MitchAlsup wrote:
> >>> If your execution window supports "instruction replay" you can predict
> >>> one (side) of CMOV and replay the other side if necessary. LDs and STs
> >>> which miss the L1 are often replayed.
> >>>
> >>> Side note: In Mc 88120 we had a FP DIV instruction that was 1/4 ULP
> >>> error at 12 cycles and it took until 17 cycles to have IEEE accuracy,
> >>> so we delivered the result at 12 and then delivered the result again
> >>> if precise rounding changed the bit pattern. This resulted in a pretty
> >>> good speedup since when performing FDIV almost every instruction
> >>> sequence becomes dependent on that FDIV and its latency.
> >>
> >> That's an interesting optimization! Sounds like you had an NR style
> >> divider, so you needed an additional iteration to get to the 2N+4 bits
> >> that would guarantee correct rounding, but on the previous iteration you
> >> had a N+2 for a max error of 1/4 ULP meaning that statistically you had
> >> much better than 75% (95+%?) chance of being exactly right?
> >
> > First several iterations were Goldschmidt, with a NR correction step at the end.
> >
> > We measured 80%-ile correct.
>
> So as long as the penalty for having to resubmit the result was less
> than about 15-20 cycles (5 cycles saved on 4/5, with a 1/5 chance of
> having to restart) this would be a win, and from everything else you've
> written I'm guessing sure the resubmit cost was just a cycle or two?

I don't remember exactly, but it causes everything in the dependence chain
to rerun (get fired into execution at the reservation station entry) also.
I suspect it was no different that first delivery.

First deliver has the RS entry fire into execution capturing operand via
forwarding.

I suspect that the second result delivery would cause the RS entry to become
fireable and the picker would likely hose this one due to its age. So the
dependence chain fires off with the same timing as the improperly delivered
result's dependency chain.

So the precise answer to your question is likely 0 cycles.

Paul A. Clayton

unread,

Sep 29, 2020, 11:59:37 AM9/29/20

to

On Sunday, September 27, 2020 at 3:42:19 PM UTC-4, MitchAlsup wrote:
[snip]

> Side note: In Mc 88120 we had a FP DIV instruction that was 1/4 ULP
> error at 12 cycles and it took until 17 cycles to have IEEE accuracy,
> so we delivered the result at 12 and then delivered the result again
> if precise rounding changed the bit pattern. This resulted in a pretty
> good speedup since when performing FDIV almost every instruction
> sequence becomes dependent on that FDIV and its latency.
> > If the predicate is unresolved I would give them lowest issue priority
> > so they only contend for resources when nothing else would use them.

That is even more aggressive than what IBM did with one of its POWER
implementations (POWER5??), where the rounding was hoisted into the
start of a dependent operation.

It seems that if the latency of multiply was higher than 5 cycles (unlikely),
a final correction of a multiplication could be included. Presumably it
was much cleaner (less correctness risk, less development effort/schedule
risk, less risk to overall performance/size) to use replay than to try to
implement late correction on dependent operations. (Some dependent
operations might not even have changed results under different rounding
and that could, theoretically, be detected. In my opinion exact rounding is
not really that useful, but I am not a programmer much less a numerical
analyst.)

BGB

unread,

Sep 30, 2020, 4:58:17 PM9/30/20

to

On 9/20/2020 12:01 PM, EricP wrote:
> Paul A. Clayton wrote:
>> On Friday, September 18, 2020 at 11:40:06 PM UTC-4, Kli-Kla-Klawitter
>> wrote:
>>> In another forum someone alleged that there are OoO-CPUs which
>>> execute both paths of a branch speculatively and drop the path that
>>> turns out to be wrong. Is there really such a CPU, or more: is there
>>> a x86-CPU that works this way ?
>>
>> IBM's POWER7 used dynamic hammock predication for single instruction
>> forward branches. If a branch prediction was deemed low confidence,
>> the single instruction was converted into a predicated instruction.
>> This is not the same as general hammock predication (the limit to one
>> instruction) much less general eager execution.
>>
>> (Dynamic predictation has a potential advantages over predication
>> (without predicate prediction), when the condition/branch is
>> predictable, of removing a data dependency and eliminating excess
>> instructions. The instruction cache can even be designed (like a trace
>> cache) to increase spatial locality when a predictable path is very
>> temporally stable while allowing a change to a different stable path
>> or semi-stable/moderately predictable paths — where less specializing
>> would be used — or a highly unpredictable path — where predication
>> might be used. Of course, branchification of predicated code would
>> also be possible, but that transformation path seems more complex. The
>> use of predication for timing independence also makes predicate
>> prediction potentially problematic.)
>
> (Google keeps insisting that "predication" is misspelt "prediction"
> and redirecting my searches.)
>

For these reasons, Google keeps getting less effective at searching for
more or less anything...

Like, maybe, we don't need search that keeps replacing words with
"synonyms" (in some often completely unrelated context), or assumes that
words were mispelt, or tries to "correct" the search when one does
"thing -unrelated_thing" with "Did you mean 'unrelated_thing'?"...

> There is something IBM calls "bc+8 conversion to predication" in POWER8.
> In Power ISA a branch offset is relative to the branch instruction
> so bc+8 means conditional branch over 1 instruction.
>
> In POWER8 Processor User’s Manual for the Single-Chip Module, 2016
>
> 10.1.4.7 BC+8 Handling
>
> In addition to the normal branch handling, bc instructions whose BO field
> indicates only a CR dependency with a displacement of +8 are handled as a
> special case. Such bc+8 instructions are marked as potential candidates
> for conversion to predication. When the next instruction after a bc+8
> is in the list of allowed to be predicated (see Table 10-3),
> the bc+8 is marked as first for group formation processing.
> Finally, if the branch prediction for the bc+8 instruction is predicted
> not taken, the instruction pair is treated similar to an isel,
> in that the branch cannot be mispredicted and the next instruction
> is conditionally executed based on the bc outcome. Any time a bc+8
> instruction is executed, the branch prediction algorithm writes a
> value assuming the branch was not taken. This encourages all bc+8
> instructions to be converted to predication. Use of a NOP can be used
> to prevent bc+8 conversions in specific cases (making the bc+8 a bc+12).
>
> Table 10-3. bc+8 Pairable Instructions
> Assembler Mnemonic
> addi addis add and or xor ori stb sth stw std
>
> When a bc+8 instruction pair is converted to predication, the destination
> register now is pending until the predication is resolved. This can mean
> that subsequent dependent instructions are prevented from issuing.
> The bc+8 conversions must be avoided for the following circumstances:
> 1. If the branch is expected to be very predictable, the branch
>    prediction hardware outperforms predication.
> 2. If the pairable operation is the start of a dependency chain
>    involving loads, sometimes it is more beneficial to have the
>    loads start execution even if the branch eventually flushes.
>    Predication delays dependent operations from executing.
>    Therefore, if a long latency load is delayed, the performance
>    is better served by allowing the branch to be predicted.
>    This is especially true when the resultant memory access can
>    be the same cache line in either outcome. Branch misprediction
>    can be    hidden by the long latency memory reference while
>    the dependency delay due to predication cannot be hidden.
>    Heap sort algorithms show this effect.
>

Though, admittedly, to me treating a short-forward-branch as a
predication prefix seems kinda like a poor-man's solution (vs including
it in the encoding).

Or, OTOH, having it directly in the encoding avoids need for having
logic to infer that a short forward branch implies predicating the
following instructions...

Granted, there is always the risk that the assumed tradeoffs made by the
compiler don't match those which are most appropriate for the core (eg:
by predicating either too many or too few instructions and leading to
less-than-optimal results).

And, admittedly, off-hand I don't have much better of a solution than
checking if the operations in the block match what are allowed for
predicated expressions, and deciding based on the operation count.

Ivan Godard

unread,

Sep 30, 2020, 5:45:14 PM9/30/20

to

My Google searches (via Firefox) always give me a "did you really mean"
linkwith the original spelling when it does a substitution. Don't you
get that?

MitchAlsup

unread,

Sep 30, 2020, 6:06:31 PM9/30/20

to

Put a + in front of predication.

You are running under the assumption that instructions are PREDicated
one by one. My 66000 casts an 8-instruction shadow and groups instruction
under the shadow as being part of the Then-clause or part of the Else-clause.

>
>
> Granted, there is always the risk that the assumed tradeoffs made by the
> compiler don't match those which are most appropriate for the core (eg:
> by predicating either too many or too few instructions and leading to
> less-than-optimal results).

This is why in certain conversations I use the term depessimizing instead of
optimizing. If code was really "optimized" then further development of the
compiler would not improve performance.

MitchAlsup

unread,

Sep 30, 2020, 6:10:16 PM9/30/20

to

On Tuesday, September 29, 2020 at 10:59:37 AM UTC-5, Paul A. Clayton wrote:
> On Sunday, September 27, 2020 at 3:42:19 PM UTC-4, MitchAlsup wrote:
> [snip]
> > Side note: In Mc 88120 we had a FP DIV instruction that was 1/4 ULP
> > error at 12 cycles and it took until 17 cycles to have IEEE accuracy,
> > so we delivered the result at 12 and then delivered the result again
> > if precise rounding changed the bit pattern. This resulted in a pretty
> > good speedup since when performing FDIV almost every instruction
> > sequence becomes dependent on that FDIV and its latency.
> > > If the predicate is unresolved I would give them lowest issue priority
> > > so they only contend for resources when nothing else would use them.
>
> That is even more aggressive than what IBM did with one of its POWER
> implementations (POWER5??), where the rounding was hoisted into the
> start of a dependent operation.
>
> It seems that if the latency of multiply was higher than 5 cycles (unlikely),
> a final correction of a multiplication could be included.

You must remember I work INSIDE function units and generally try to speak
in gates of delay not cycles of delay. I have seen no multiplier tree take
more than 2-cycles since about 1984. The rest of the FMUL (or FMAC) are
more stages.

> Presumably it
> was much cleaner (less correctness risk, less development effort/schedule
> risk, less risk to overall performance/size) to use replay than to try to
> implement late correction on dependent operations. (Some dependent
> operations might not even have changed results under different rounding
> and that could, theoretically, be detected. In my opinion exact rounding is
> not really that useful, but I am not a programmer much less a numerical
> analyst.)

We did not even KNOW we were broadcasting bad data until the rounding was
finished and did not match the broadcast result. That correction step was
a complete Newton-Raphson iteration (4-cycles) plus a compare cycle.

Jan Coombs

unread,

Oct 1, 2020, 5:27:04 AM10/1/20

to

On Wed, 30 Sep 2020 15:58:10 -0500
BGB <cr8...@gmail.com> wrote:

> For these reasons, Google keeps getting less effective at searching for
> more or less anything...
>
> Like, maybe, we don't need search that keeps replacing words with
> "synonyms" (in some often completely unrelated context), or assumes that
> words were mispelt, or tries to "correct" the search when one does
> "thing -unrelated_thing" with "Did you mean 'unrelated_thing'?"...

"predication" - 9th hit is computing related:
https://duckduckgo.com/?q=predication&t=ffnt&atb=v102-1&ia=web

"predication computer architecture" - a sprinkle of hits:
https://duckduckgo.com/?q=predication+computer+architecture&t=ffnt&atb=v102-1&ia=web

Jan Coombs
--

Terje Mathisen

unread,

Oct 1, 2020, 4:38:59 PM10/1/20

to

MitchAlsup wrote:
> On Wednesday, September 30, 2020 at 3:58:17 PM UTC-5, BGB wrote:
>> Though, admittedly, to me treating a short-forward-branch as a
>> predication prefix seems kinda like a poor-man's solution (vs including
>> it in the encoding).
>>
>> Or, OTOH, having it directly in the encoding avoids need for having
>> logic to infer that a short forward branch implies predicating the
>> following instructions...

The advantage of doing it dynamically is that the cpu can use previous
prediction history to determine if predication makes sense?

>
> You are running under the assumption that instructions are PREDicated
> one by one. My 66000 casts an 8-instruction shadow and groups instruction
> under the shadow as being part of the Then-clause or part of the Else-clause.
>>
>>
>> Granted, there is always the risk that the assumed tradeoffs made by the
>> compiler don't match those which are most appropriate for the core (eg:
>> by predicating either too many or too few instructions and leading to
>> less-than-optimal results).
>
> This is why in certain conversations I use the term depessimizing instead of
> optimizing. If code was really "optimized" then further development of the
> compiler would not improve performance.

We have all probably seen far to many examples of pessimal code. :-(

Paul A. Clayton

unread,

Oct 1, 2020, 5:43:04 PM10/1/20

to

On Thursday, October 1, 2020 at 4:38:59 PM UTC-4, Terje Mathisen wrote:
> MitchAlsup wrote:

[snip]

>> If code was really "optimized" then further development of the
> > compiler would not improve performance.

Unless the workload (data set, task frequency, etc.), microarchitecture,
system hardware, system occupancy, or other such factor changes.

While most of the improvements exploiting such differences would
be at a level higher than usually allowed for a compiler (partially
because compilers are generally not provided with such information
and target binaries "portable" across systems and uses).

> We have all probably seen far to many examples of pessimal code. :-(

If it eventually completes with a usefully correct result, it is not (in a
strict sense) pessimal. Even downloading GiBs of potential software
dependencies (and polling for Internet access if not available at
start-up) is not *strictly* pessimal. (One must provide availability
to every conceivable feature, and obviously one should not force
users to pay for such bloat. The option to turn off dependency
downloads is clearly documented, and any user unwilling to make
the effort to read the documentation (Beware of the Leopard) is
clearly not that interested in program run time. The documentation
is even downloaded every time the program is run, so it is not like
one has to travel to the home office in Alpha Centauri to read it.)

BGB

unread,

Oct 1, 2020, 5:48:01 PM10/1/20

to

I have seen this one as well, I think it may depend on what one is
searching for, and whether it thinks it is a misspelt term, or whether
it thinks the terms are synonymous and it wants to use the more
"popular" term (or something to this effect).

Sometimes, it would be convenient to have an older-style search engine
(without any of this "cleverness").

MitchAlsup

unread,

Oct 1, 2020, 9:09:29 PM10/1/20

to

On Thursday, October 1, 2020 at 4:43:04 PM UTC-5, Paul A. Clayton wrote:
> On Thursday, October 1, 2020 at 4:38:59 PM UTC-4, Terje Mathisen wrote:
> > MitchAlsup wrote:
> [snip]
> >> If code was really "optimized" then further development of the
> > > compiler would not improve performance.
>
> Unless the workload (data set, task frequency, etc.), microarchitecture,
> system hardware, system occupancy, or other such factor changes.

I am going to disagree, a bit here::

When We did the GBOoO Mc 88120 we learned that your wide reservation
station architectures wanted exactly one thing for top performance::
minimum instruction count--and the HW could pretty much absorb the
variations in latency.

Loop unrolling:: only worked when we were trying to compress the
loop overhead (AD/CMP/BLT) into what smelled like one loop per cycle.
Unrolling by 3 looked near optimal, but 4 had fewer odd boundary
conditions.

>
> While most of the improvements exploiting such differences would
> be at a level higher than usually allowed for a compiler (partially
> because compilers are generally not provided with such information
> and target binaries "portable" across systems and uses).
>
> > We have all probably seen far to many examples of pessimal code. :-(
>
> If it eventually completes with a usefully correct result, it is not (in a
> strict sense) pessimal. Even downloading GiBs of potential software
> dependencies (and polling for Internet access if not available at
> start-up) is not *strictly* pessimal. (One must provide availability
> to every conceivable feature, and obviously one should not force
> users to pay for such bloat. The option to turn off dependency
> downloads is clearly documented, and any user unwilling to make
> the effort to read the documentation (Beware of the Leopard) is
> clearly not that interested in program run time. The documentation
> is even downloaded every time the program is run, so it is not like
> one has to travel to the home office in Alpha Centauri to read it.)

That is why I used the term depessimization not pessimal code.

BGB

unread,

Oct 1, 2020, 10:08:19 PM10/1/20

to

Possibly.

As I can note, in my case they are predicated one-by-one. As a result it
is also possible to bundle instruction from both the "then" and "else"
branches together and rely on them being mutually exclusive (or, in
effect, instructions with opposite predication can never alias).

>>
>>
>> Granted, there is always the risk that the assumed tradeoffs made by the
>> compiler don't match those which are most appropriate for the core (eg:
>> by predicating either too many or too few instructions and leading to
>> less-than-optimal results).
>
> This is why in certain conversations I use the term depessimizing instead of
> optimizing. If code was really "optimized" then further development of the
> compiler would not improve performance.

Probably true...

In my recent effort, I have found one bug in the CPU core (one of the
encodings which was supposed to do an unsigned multiply did a signed
multiply instead); several in the C runtime; and several issues in my C
compiler (some bugs in the type-system logic were causing some necessary
type conversions to not be performed).

C Runtime issues:
"malloc" was making use of a dangling pointer;
The recorded alloc size was incorrect;
The incorrect size could break things after free/reuse.
"memcmp" was effectively broken;
Was effectively only comparing the first and last 8 bytes.
"memcpy" could potentially miss tailing bytes in some cases.
...

Eg, the compiler type-system issue:
unsigned int ui;
int i;
long long li;
i=ui; //simply moved value (assumed high bits didn't matter)
li=i; //simply moved value (assumed high bits were correct)

The result of this was ending up with values not being properly sign or
zero extended if moved along certain paths.

I seem to still be getting some weird behavior in these areas (after
poking at some things in the type promotion logic), implying there may
still be some more bugs in this area.

Still haven't resolved the issues I was looking for.

There appears to be a case where there is a behavioral difference
between the CPU core and my emulator, but I have yet to isolate it.

Its main effect is that things can skew/warp weirdly when drawing the
floor and ceiling in Doom in the Verilog implementation.

There also seems to be an issue where "somewhere or another" things are
behaving differently on BJX2 than they do on a normal PC, but given this
is only showing up indirectly (via emergent behaviors), it is being
rather difficult to nail down.

Terje Mathisen

unread,

Oct 2, 2020, 2:47:13 AM10/2/20

to

MitchAlsup wrote:
> On Thursday, October 1, 2020 at 4:43:04 PM UTC-5, Paul A. Clayton wrote:
>> On Thursday, October 1, 2020 at 4:38:59 PM UTC-4, Terje Mathisen wrote:
>>> MitchAlsup wrote:
>> [snip]
>>>> If code was really "optimized" then further development of the
>>>> compiler would not improve performance.
>>
>> Unless the workload (data set, task frequency, etc.), microarchitecture,
>> system hardware, system occupancy, or other such factor changes.
>
> I am going to disagree, a bit here::
>
> When We did the GBOoO Mc 88120 we learned that your wide reservation
> station architectures wanted exactly one thing for top performance::
> minimum instruction count--and the HW could pretty much absorb the
> variations in latency.
>
> Loop unrolling:: only worked when we were trying to compress the
> loop overhead (AD/CMP/BLT) into what smelled like one loop per cycle.
> Unrolling by 3 looked near optimal, but 4 had fewer odd boundary
> conditions.

Very simple unrolling provides near-optimal performance with low overhead.

I.e. in my tcpip copy_with_checksum code I really fell in love with the,
in several senses of the word, optimal Pentium code:

next_dword:
;; First cycle
mov [edi],edx ; U-pipe: Store previous dword
mov edx,[esi+edi] ; V-pipe: Get the next
;; Second cycle
adc eax,edx ; U: accumulate with carry from previous
lea edi,[edi+4] ; V: Update dest (&source!) ptr
;; Third cycle
dec ecx ; U: Loop counter, carry is saved
jnz next_word ; V: Predicted as taken

So this is 3 perfect cycles per pair of 16-bit words, or 1.5 cycle/word

In the linux kernel this same code simply used unrolling, something like

next_16:
adc eax,[esi]
adc eax,[esi+4]
adc eax,[esi+8]
adc eax,[esi+12]
lea esi,[esi+16]
adc eax,0
cmp esi,ecx ;; More 16-byte blocks?
jb next_16

Back-to-back ADC from memory ($L1) would take 2 cycles each, so even
with the extra ADC EAX,0 as part of the loop overhead, this code would
on average run slightly faster than my "perfect" version. :-)

BGB

unread,

Oct 2, 2020, 3:07:34 PM10/2/20

to

There were, it seems:
Some other cases were, rather than promoting to the correct type for the
operands, instead promoting to another type, and then leading to the
incorrect result.

Eg:
unsigned int x, y;
int z;
...
z=x/y;

Where the bug led to doing the '/' operation as 'int' rather than as
'unsigned int', essentially:
z=((int)x)/((int)y);

This had already partly addressed but on another code-path (which broke
with tweaking some other type-system rules, causing it to fall back to a
path which had this bug).

The workaround in this case was to tweak the logic to do, essentially:
unsigned int T; //temporary
T=x/y;
z=(int)T; //needed for correct sign-extension.

The original code-path would do the operation as "unsigned int", but
then the result value would have been zero-extended to 64-bits despite
its type being 'int'.

...

This was mostly an IR level issue (in the RPN -> 3AC conversion stage).

>
> Still haven't resolved the issues I was looking for.
>
>
> There appears to be a case where there is a behavioral difference
> between the CPU core and my emulator, but I have yet to isolate it.
>
> Its main effect is that things can skew/warp weirdly when drawing the
> floor and ceiling in Doom in the Verilog implementation.
>

Some of my other (prior) tweaks appear to have maybe caused this issue
to go away. Still didn't entirely nail it down (the running simulation
is from around the time I fixed a bug with the builtin-memcpy logic, and
worked some on the builtin-memset).

However, some other evidence implies it may have been due to
out-of-range angles (causing accesses to go outside of the sine/cosine
lookup tables), possibly due to one of the previous bugs (previously,
range-masking the angles did partially improve the situation; in that
the floor/cieling textures looked sensible but were not at the correct
orientation).

> There also seems to be an issue where "somewhere or another" things are
> behaving differently on BJX2 than they do on a normal PC, but given this
> is only showing up indirectly (via emergent behaviors), it is being
> rather difficult to nail down.
>

Still ongoing.

After fixing the type-promotion bugs, the behaviors changed here, but
still aren't really "correct".

Though, this is depending on demo playback in ROTT, which has a lot of
"butterfly effect" properties, such as events which happen can influence
the state of the RNG, which changes other things which happen, ...

Similarly, the player's position depends on knockback, which means when
and how the player gets shot/... influences where they end up, and this
is in turn dependent on the positions and behaviors of the enemies,
which are in-turn subject to the RNG (it tries to get repeatable
playback by reseeding the RNG with 0 whenever playing a demo).

Contrast, Doom engine games seem a little more stable here, as the
enemies do not depend on a shared RNG state (so, a slight variation here
or there is less prone to wreck things; and small errors often take a
little longer to build up to point of breaking the demo; though at least
luckily it supplies its own RNG rather than depending on
"rand()"/"srand()" or similar, ...).

The Quake system of just recording where everything was, ..., was
admittedly a lot simpler, but I guess less subject to point out the
existence of minor behavioral issues.

Side note:
Partly as a test, I did add a more Quake-like demo playback system to
ROTT, but this was partly in my (thus far futile) attempt to locate the
initial cause of the divergence (does seem to occur very early on, as
divergences start to pop up pretty much immediately as soon as playback
starts).

Kinda half wondering if mainstream CPUs do something similar for their
POST tests (eg, using butterfly effect to test a bunch of different
instructions); as testing instructions individually gets pretty bulky
(and a POST test probably doesn't need to care which instruction with
which inputs caused the error; only that an error has taken place).

MitchAlsup

unread,

Oct 2, 2020, 3:42:20 PM10/2/20

to

The unrolling example I point to comes from the Mc 88120 and in particular
MATRIX300, where we have the inner loop::

DO 90 j = 1,n
IF (beta.EQ.zero) THEN
DO 50 i = 1,m
c(i,j) = zero
50 CONTINUE
ELSE IF (beta.NE.one) THEN
DO 60 i = 1,m
c(i,j) = beta*c(i,j)
60 CONTINUE
END IF
DO 80 l = 1,k
temp = alpha*b(l,j)
DO 70 i = 1,m
c(i,j) = c(i,j) + temp*a(i,l)
70 CONTINUE
80 CONTINUE
90 CONTINUE

And the inner loop does LD a[i,l]; LD C[i,j]; FMUL; FADD; ST c[i,j]
5 instruction doing the work, along with ADD/CMP/BLT managing the
loop.

If we unroll this 3 times, we get:

packet0:
LDD Ra0,[Ral+Ri<<3]
LDD Rc0,[Rcj+Ri<<3]
FMUL Rt0,Rtemp,Ra0
FADD Rc0,Rt0,Rc0
ST Rco,[Rxj+Ri<<3]
ADD Ri,Ri,#24
packet1:
LDD Ra0,[Ral+Ri<<3-16]
LDD Rc0,[Rcj+Ri<<3-16]
FMUL Rt0,Rtemp,Ra0
FADD Rc0,Rt0,Rc0
ST Rco,[Rxj+Ri<<3-16]
CMP Rt,Ri,Rk
packet3:
LDD Ra0,[Ral+Ri<<3-8]
LDD Rc0,[Rcj+Ri<<3-8]
FMUL Rt0,Rtemp,Ra0
FADD Rc0,Rt0,Rc0
ST Rco,[Rxj+Ri<<3-8]
BLT Rt,packet1

Remember this was on a machine without an FMAC (circa 1991): 6 IPC if the
execution window can absorb L1 miss (which Mc 88120 could). The only reason
to unroll the loop was to amortize the ADD/CMP/BLT so they appeared to take
only the odd leftover space on a per packet basis.

Today with FMAC the loop "work" only needs 4 instructions (3 of them memory
refs) and this is one of the reasons I invented the LOOP instructions so
the loop overhead could be compressed into a single instruction and this
(and many more) loops would need no unrolling to remain optimal.

So, if you are contemplating real vector workloads, you need high bandwidth
low latency memory ports much more than you need several parallel lanes of FP
calculation capability.

George Neuner

unread,

Oct 3, 2020, 12:20:49 AM10/3/20

to

On Wed, 30 Sep 2020 14:45:11 -0700, Ivan Godard
<iv...@millcomputing.com> wrote:

>My Google searches (via Firefox) always give me a "did you really mean"
>linkwith the original spelling when it does a substitution. Don't you
>get that?

I get that only if the "misspelled" word was entered in quotes: i.e.
searching for the "exact word or phrase".

Without quoting the search term, I get:

Showing results for <suggested correction(s)>
Search instead for <original>?

Which browser doesn't matter ... by default, Google shows results for
what *it thinks* I meant to type.

YMMV, but I don't think users should have to quote search terms. IMO
the default should be to show results for the search that was entered
and to suggest alternative searches - exactly the reverse of what
Google does.

For comparison, DuckDuckGo does not suggest alternatives for an exact
quoted search term. For an unquoted term, it produces results for
BOTH the original and the suggested alternative, and then offers to
show results only for the original term.

Including results for <suggested correction>
Search only for <original>?

IMO, DuckDuckGo's approach is the more useful, but I would prefer that
the search engine by default look for the exact search terms. I would
like it to suggest alternatives (if possible) but not to show results
for them unless I ask it to.

Again, YMMV.
George

Paul A. Clayton

unread,

Oct 3, 2020, 9:26:34 AM10/3/20

to

On Thursday, October 1, 2020 at 9:09:29 PM UTC-4, MitchAlsup wrote:
> On Thursday, October 1, 2020 at 4:43:04 PM UTC-5, Paul A. Clayton wrote:
>>> MitchAlsup wrote:
>> [snip]
>>>> If code was really "optimized" then further development of the
>>>> compiler would not improve performance.
>>
>> Unless the workload (data set, task frequency, etc.), microarchitecture,
>> system hardware, system occupancy, or other such factor changes.
>
> I am going to disagree, a bit here::
>
> When We did the GBOoO Mc 88120 we learned that your wide reservation
> station architectures wanted exactly one thing for top performance::
> minimum instruction count--and the HW could pretty much absorb the
> variations in latency.

Presumably different code sequences would have different energy
use; with thermal/power-informed dynamic voltage/frequency
adjustment, such effects might have an impact on performance.
(The granularity of monitoring and control would typically not keep
such from happening, but it seems possible.)

Given that OS page placement can have a detectable impact on
performance, I would not rule out other "unimportant" factors from
changing performance. This is one reason that I use "optimize" in
the loose sense rather than the strict sense. In the strict sense, the
word does not seem useful (one would have to specify so many
conditions that the application is severely limited; given the jitter in
performance measurement and the complexity of modeling reality,
I doubt one could even be certain that a particular program is
strictly optimal for a specific machine with a specific workload).

Given the possibility of reducing resource consumption via
instruction fusion and other techniques dependent on instruction
type and relative placement, even instruction count might not be
a perfect measure of performance.

I realize you wrote "a bit" (and it was educational to state that
out-of-order execution tends to average scheduling effects and
otherwise reduce opportunities for performance improvement),
but when using optimize in the strictest sense it is not even clear
Heisenberg uncertainty applies (or how it applies if one only cares
about time and not energy or any other metric) or if one should
compare some "median" of theoretical quantum probabilities. (I
only know enough about physics to be dangerous.)

Of course, for extremely specific workloads, a compiler can
introduce absurd modifications; this is usually called rigging the
benchmark.