Anybody got a number for average number of iterations in loops in general?

Ivan Godard

unread,

Apr 2, 2013, 6:17:37 PM4/2/13

to

I'm trying to get a sense of the actual quality of branch predictors. If
we assume that loops will be perfectly predicted, then the 90+% or so
reported quality of branch prediction algorithms breaks down into some
number of loops with 100% prediction, and some amount of open code with
much poorer prediction quality.

I expected that a quick Google would give me an average iteration count,
which with numbers for the percent of time spent in loops and a few bald
assumptions would give a sense of how well predictors actually do in
open code. However, all I find are counts for particular algorithms,
reported as part of big-O estimates, whereas I'm looking for a number,
even a WAG, for programs in general.

???

Ivan

MitchAlsup

unread,

Apr 2, 2013, 11:24:12 PM4/2/13

to

On Tuesday, April 2, 2013 5:17:37 PM UTC-5, Ivan Godard wrote:
> I'm trying to get a sense of the actual quality of branch predictors.

Branch predictors should, in my opinion, be measured by how many predictions
survice execution and retire instructions. Most branch statistics remove
this kind of information (multiple branches in the shadow of a mispredicted
branch are not counted.)

Thus, with a 90% prediction rate and 10 cycles to branch resolution, only
about 50% of issued instructions retire!

It is like the hit rate of caches has to be considered in the light of the
miss multiplier.

Mitch

Ivan Godard

unread,

Apr 3, 2013, 12:27:25 AM4/3/13

to

Yes, although I'm asking in the context of an in-order so there's no
shadowing. But whether shadows are removed or don't exist, I'm still
trying to get a sense of predictor accuracy for open code, outside of
loops. Any number?

nm...@cam.ac.uk

unread,

Apr 3, 2013, 3:21:13 AM4/3/13

to

In article <kjgave$tbc$1...@dont-email.me>,

Well, sort-of, though I have only rarely seen figures and they were
decades ago. It's far too dependent on the code to simplify that
far - figures of between 80% and 99.95% have been touted.

Let's ignore the often dominant unconditional branches and, since
you say, loops. Though high-count loops are different from low-
count ones. And some compilers are very profligate at adding
unconditional or predictable branches.

In some codes, most branches are rare exceptional cases. Prediction
can be sky high. You have probably seen papers quoting such figures,
when people have been touting their latest gimmick.

However, in a few, there are a lot of nearly equiprobable data-
dependent branches, which are unpredictable. Sorting is a classic.

Regards,
Nick Maclaren.

Andy (Super) Glew

unread,

Apr 3, 2013, 3:33:33 AM4/3/13

to

http://www.jilp.org/jwac-2/program/JWAC-2-program.htm
results of the 2011 championship branch predictor contest

however, they use a metric more like what mitch suggests than what you
want. mppki - misprediction penalty per kilo instructions.

the 2004 contest http://www.jilp.org/cbp/Agenda-and-Results.htm
reported mppi, mispredict per 1000 instructions.
2.5-3 mppi.

more like what ivan is asking for - although i think you may want
mispredicts per branch. I could guess, knowing typical instructions per
branch, but better to get the results.

(I hate it that page count limits force people to report only one
metric, rather than all 3: mpbi, mpi, and mppi)

unfortunately, these mix several codes together. I don't know of purely
open, non-loop, code - but I tend to look at gcc and your favorite
database prediction rates as the closest. (gcc = prediction for the gcc
benchmark).

anastassia ailamaki has made a bit of a specialty of architectural
studies of database workloads, e.g. this 2001
http://www.cs.cmu.edu/~natassa/aapubs/conference/walking%20four%20machines%20by%20the%20shore.pdf

- reporting 1% to 3% branch misprediction rates (mispredicts per branch)

but
http://www.cs.toronto.edu/~ryanjohn/teaching/csc2531-f11/slides/Presentation_Bogdan_CSC2531.pdf
shows high rates circa 20% for, e.g. index scans (which are loopy, but
linky)

it would be nice to see a survey - this just some random googling.

at a company like Intel there are archives of performance data with
stats like these.

--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies, which has been acquired by
Imagination Technologies; in the past of companies such as Intellectual
Ventures and QIPS, Intel, AMD, Motorola, and Gould), I reveal this only
so that the reader may account for any possible bias I may have towards
my employer's products. The statements I make here in no way represent
my employers' positions on the issue, nor am I authorized to speak on
behalf of my employers, past or present.

nm...@cam.ac.uk

unread,

Apr 3, 2013, 4:49:36 AM4/3/13

to

In article <515BDB4D...@SPAM.comp-arch.net>,

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:
>
>unfortunately, these mix several codes together. I don't know of purely
>open, non-loop, code - but I tend to look at gcc and your favorite
>database prediction rates as the closest. (gcc = prediction for the gcc
>benchmark).

I don't know enough about databases to judge, but I used figures from
the gcc benchmark even in HPC, as a reasonable indication of what a
middle-of-the-range C/C++ program might do.

>it would be nice to see a survey - this just some random googling.

The trouble is that the results are just SO application-dependent.
I can't remember the exact figures, but there was one program I
needed to look at the performance counters, and the misprediction
rate was miniscule (0.1%?) Surprise, surprise - it was a 30+K
loop-count vectorisable program, with an Amdahl's Law figure of 99.5%.

>at a company like Intel there are archives of performance data with
>stats like these.

A few of which might even contain useful information :-)

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Apr 3, 2013, 7:21:45 AM4/3/13

to

nm...@cam.ac.uk wrote:
> In article <kjgave$tbc$1...@dont-email.me>,
> Ivan Godard <iv...@ootbcomp.com> wrote:
>> Yes, although I'm asking in the context of an in-order so there's no
>> shadowing. But whether shadows are removed or don't exist, I'm still
>> trying to get a sense of predictor accuracy for open code, outside of
>> loops. Any number?
>
> Well, sort-of, though I have only rarely seen figures and they were
> decades ago. It's far too dependent on the code to simplify that
> far - figures of between 80% and 99.95% have been touted.

I would say that anything over 80% is _very_ good. :-)

>
> Let's ignore the often dominant unconditional branches and, since
> you say, loops. Though high-count loops are different from low-
> count ones. And some compilers are very profligate at adding
> unconditional or predictable branches.
>
> In some codes, most branches are rare exceptional cases. Prediction
> can be sky high. You have probably seen papers quoting such figures,
> when people have been touting their latest gimmick.
>
> However, in a few, there are a lot of nearly equiprobable data-
> dependent branches, which are unpredictable. Sorting is a classic.

I have tried many, many approaches to handle (i.e. remove) branches like
this, and they almost always end up taking longer than the naive/branchy
version:

Using CMOVcc is only a win when the branch miss rate is significant and
the pipeline cost is very significant, i.e. Pentium4 style long & fast
pipeline.

For sorting (quicksort pivoting stage) I have tried both CMOVcc on the
data items and on the data addresses, assuming that since both data
items have to be in L1, so the cost of writing back the same (unswapped)
data should be close to zero, right?

With random input data (very rarely true!) this should give an
unpredictable 50% swap/noswap branch and a big win for branch removal code.

Turns out this has been wrong every time I have measured it.

My only real success with branch removal has been in SIMD code, where I
could combine 4x parallel processing with straight-line code, this gave
a very significant speedup.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

nm...@cam.ac.uk

unread,

Apr 3, 2013, 8:17:59 AM4/3/13

to

In article <avft2a-...@ntp-sure.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>
>> In some codes, most branches are rare exceptional cases. Prediction
>> can be sky high. You have probably seen papers quoting such figures,
>> when people have been touting their latest gimmick.
>>
>> However, in a few, there are a lot of nearly equiprobable data-
>> dependent branches, which are unpredictable. Sorting is a classic.
>
>I have tried many, many approaches to handle (i.e. remove) branches like
>this, and they almost always end up taking longer than the naive/branchy
>version:

Yup. I have done so once, and it was almost exactly the same speed,
but that was using the all powers of 2 and 3 Shellsort on a true vector
supercomputer :-)

>My only real success with branch removal has been in SIMD code, where I
>could combine 4x parallel processing with straight-line code, this gave
>a very significant speedup.

I have got a factor of 4 on serial code by simple unconditional
branch removal, but that was back in the days before modern branch
predictors!

My more recent experience is similar to yours.

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Apr 3, 2013, 10:46:41 AM4/3/13

to

If *old* data is acceptable, Table 2 (page 7 of the PDF
I have) in "Branch Prediction for Free" (Thomas Ball and
James R. Larus, TR#1137, 1993) gives the misprediction
rates for loop branches with perfect static prediction.
I think the multiplicative inverse of such would give
the average count for such branches within a loop.

E.g., gcc (from SPEC89) was 15% (c. 6.7 iterations),
xlisp was 19% (c. 5.3 iterations) , awk was 3% (c. 30
iterations), fpppp was 34% (c. 2.9 iterations--surprising
for an FP program--with 86% of branches being non-loop).

Paul A. Clayton

unread,

Apr 3, 2013, 11:05:08 AM4/3/13

to

On Apr 3, 3:33 am, "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
wrote:
[snip]

> (I hate it that page count limits force people to report only one
> metric, rather than all 3: mpbi, mpi, and mppi)

There is not as much excuse for JILP papers since
they are published online as PDFs. For something
like the branch prediction contest, it might even
be practical to have such tables generated
automatically.

NOTE: I am not complaining about JILP! I think it
is wonderful to have a free online journal. JILP
is/was not perfect (and seems to have stopped
publishing in 2011), but it published some
interesting content.

Ivan Godard

unread,

Apr 3, 2013, 12:04:54 PM4/3/13

to

Did you ever figure out *why* if-conversion (conditional-move in yor
case) was ineffective? Did the microcode turn it back into a branch?

Ivan

Terje Mathisen

unread,

Apr 3, 2013, 3:44:42 PM4/3/13

to

No, I did not, and this has been frustrating. :-(

One example was like this:

; Swap if [ESI] > [EDI]

mov eax,[esi]
mov ebx,[edi]
cmp eax,ebx
mov edx,ebx
cmova ebx,eax
cmova eax,edx
mov [esi],eax
mov [edi],ebx

where it was faster to do a "JA skip" after the CMP and then write the
data back (in opposite order) only if needed.

Another case used a dummy write target so that I could safely write data
instead of skipping the stores to the destination, also not a win.

I've also tried to unroll the qsort scan for values to swap, doing N
iterations with the address update nulled out after finding a candidate.

I have of course used branchless code for things like max() and min(),
with SBB to spread the carry bit across a register that was used to mask
one or more operations.

Branch Predictors are simply too good! :-)

MitchAlsup

unread,

Apr 4, 2013, 12:16:38 AM4/4/13

to

On Wednesday, April 3, 2013 2:44:42 PM UTC-5, Terje Mathisen wrote:
> I have of course used branchless code for things like max() and min(),

Max, min, and saturate SHOULD be instructions these days.

But I think it would be highly instructive to compare the performance of
branch predictors while running excursions through the OS and ignore the
performance of the same predictors while running applications. The best
performance on the excursions through the OS wins.

The results would probably separate the men from the boys.

Mitch

Ivan Godard

unread,

Apr 4, 2013, 12:33:54 AM4/4/13

to

On 4/3/2013 9:16 PM, MitchAlsup wrote:
> On Wednesday, April 3, 2013 2:44:42 PM UTC-5, Terje Mathisen wrote:
>> I have of course used branchless code for things like max() and min(),
>
> Max, min, and saturate SHOULD be instructions these days.
>

Mill does saturate, but it uses our zero-latency equivalent of
conditional move for max and min; the timing is the same as the
relational op. People have occasionally argued that max and min should
be native to avoid the relational result adding to operand clutter, but
have never yet convinced the rest that it was worth the ISA entropy.

> But I think it would be highly instructive to compare the performance of
> branch predictors while running excursions through the OS and ignore the
> performance of the same predictors while running applications. The best
> performance on the excursions through the OS wins.
>

I recall an estimate that 25% of OS cycles were memcpy and memset. Those
should predict pretty well :-)

Ivan

Andy (Super) Glew

unread,

Apr 4, 2013, 5:32:18 AM4/4/13

to

> Ivan Godard <iv...@ootbcomp.com> wrote:
>> On 4/2/2013 8:24 PM, MitchAlsup wrote:
>>> It is like the hit rate of caches has to be considered in the light of the
>>> miss multiplier.
>>
>> Yes, although I'm asking in the context of an in-order so there's no
>> shadowing. But whether shadows are removed or don't exist, I'm still
>> trying to get a sense of predictor accuracy for open code, outside of
>> loops. Any number?

I hope that the pointers I provided were useful. I expect
scholar.google.com will provide more, although you will probably have to
grab them from multiple papers. Survey papers are nice, but rare.

The outside world doesn't necessarily have the luxury or archived
performance results that a company like Intel has. Although a modest
academic research group may well have. Hmmm.... I may have archived
result from back when I was in gradschool that I can share, but I think
that you are better off with more recent.

Or... some nice person might run vtune on some code, and send you the
numbers.

Anyway, I feel impelled to jump at your statement

>>I'm asking in the context of an in-order so there's no
>> shadowing.

This is only true for the simplest possible in-order machine:

an in-oder machine that issues an instruction, and then waits until it
is finished before starting the next.

Many in-order machines issue and instruction, and then issue the next as
soon as the previous cannot fault, and so long as the next does not have
any RAW, WAW, or WAR hazards with the previous. Heck, you can do
register renaming on an in-ordrer processor, and only stall at the first
RAW hazard.

Many in-order machines also have branch prediction.

So in-order machines may also execute a large fraction of instructions
that will never actually be used.

It's the accuracy of prediction,
* the number of instructions on the wrong-path,
before the misprediction is corrected for a better-path.

Andy (Super) Glew

unread,

Apr 4, 2013, 5:36:31 AM4/4/13

to

On 4/3/2013 4:21 AM, Terje Mathisen wrote:

> nm...@cam.ac.uk wrote:
>> In article <kjgave$tbc$1...@dont-email.me>,
>> Ivan Godard <iv...@ootbcomp.com> wrote:
>>> Yes, although I'm asking in the context of an in-order so there's no
>>> shadowing. But whether shadows are removed or don't exist, I'm still
>>> trying to get a sense of predictor accuracy for open code, outside of
>>> loops. Any number?
>>
>> Well, sort-of, though I have only rarely seen figures and they were
>> decades ago. It's far too dependent on the code to simplify that
>> far - figures of between 80% and 99.95% have been touted.
>
> I would say that anything over 80% is _very_ good. :-)

Y'know, we in the biz talk about a 1% improvemet in the prediction rate,
from 95% to 94%, being a 20% reduction in the misprediction rate.
And 94%->93% being 25%...

Basically, predictors can be very accurate.

But less so for workloads I consider interesting.

nm...@cam.ac.uk

unread,

Apr 4, 2013, 5:39:15 AM4/4/13

to

In article <515D48A2...@SPAM.comp-arch.net>,

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:

>> Ivan Godard <iv...@ootbcomp.com> wrote:
>
>Anyway, I feel impelled to jump at your statement
>
> >>I'm asking in the context of an in-order so there's no
> >> shadowing.
>
>This is only true for the simplest possible in-order machine:
>
>an in-oder machine that issues an instruction, and then waits until it
>is finished before starting the next.
>
>Many in-order machines issue and instruction, and then issue the next as
>soon as the previous cannot fault, and so long as the next does not have
>any RAW, WAW, or WAR hazards with the previous. Heck, you can do
>register renaming on an in-ordrer processor, and only stall at the first
>RAW hazard.

Out of curiosity, what would you call a traditional pipelined
machine that issues instructions in strict order as soon as
the previous one has moved one position along the pipeline?
It can fault while up to pipeline instructions are on the go,
but nothing is ever reordered.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Apr 4, 2013, 5:45:44 AM4/4/13

to

In article <515D499F...@SPAM.comp-arch.net>,

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:

>On 4/3/2013 4:21 AM, Terje Mathisen wrote:
>>>>
>>>> Yes, although I'm asking in the context of an in-order so there's no
>>>> shadowing. But whether shadows are removed or don't exist, I'm still
>>>> trying to get a sense of predictor accuracy for open code, outside of
>>>> loops. Any number?
>>>
>>> Well, sort-of, though I have only rarely seen figures and they were
>>> decades ago. It's far too dependent on the code to simplify that
>>> far - figures of between 80% and 99.95% have been touted.
>>
>> I would say that anything over 80% is _very_ good. :-)
>
>Y'know, we in the biz talk about a 1% improvemet in the prediction rate,
>from 95% to 94%, being a 20% reduction in the misprediction rate.
>And 94%->93% being 25%...

You mean 95%->96%.

>Basically, predictors can be very accurate.

Especially on the sort of codes that are fairly straightforward
to analyse, optimise, vectorise and parallelise. All of the
predictability properties are loosely associated.

>But less so for workloads I consider interesting.

Yes. I have posted before how I think they could be improved
for the object-oriented access patterns, but it's provably
impossible to do much about the trickier codes.

Regards,
Nick Maclaren.

Andy (Super) Glew

unread,

Apr 4, 2013, 6:04:13 AM4/4/13

to

I wrote up an explanation for the Intel compiler optimization guide.

Part is an artifact of the implementation:

Until recently, Intel OOO machines were fundamentally 2-input dataflow
machines.

CMOV or SELECT is a 3 input, 1 output, operation:

CMOV dest := if condition then src else old-value-of-dest

When I say "dataflow" I mean that they are required to read both the src
and the old-value of the dest.

Whichever arrives last delays the whole instruction.

Furthermore, 2-inputs meant that the microcode looks like:

tmp := merge(cond,src)
dest := cond_select_from_high_bits( tmp,old-value-of-dest)

I.e. CMOV is 2 uops.

But even if CMOV is 1 uop, it costs as many inputs as 1.5 uops.
Sometimes 2, since many uops have only one register input, the other
being constant. And often the true superscalar limit is not number of
instructions, but number of register reads.

Anyway, so something like

...
IF a[i] > amax THEN
amax = a[i]
amaxindex = i
FI
...

looks like, in CMOV uops

flags := CMP a[i], amax
CMOV amax := flags.gt, a[i], amax
CMOV amaxindex := flags.gt i, amaxindex

each CMOV uop costing 2 2-input uops.

whereas branchy code looks like

flags := CMP a[i], amax
IF flags.gt GO around
max := a[i]
amaxindex := i
around:

i.e. the branchful code costs
branch taken
1 input, 1 instruction
branch not taken
3 inputs, 3 instructions
less 1 instruction with CMP-JMP fusion.

whereas the CMOV code costs
4 uops at all times, but 6 RF read ports.

Cost of the branch is weighted by probability
Ptaken * 1inst + (1-Ptaken)*3 instructions + Pmispredect*dependent

Say you have 95% prediction (and, for a max scanning loop, that is way
too low - more like 99%)
and the code is arranged properly
Let the dependent penalty be 20 "units"

3-2*Ptaken + 20/20 compared to 4

Well, you get the idea - for these numbers branchless ties.

You can create similar estimate formulae for the RF port bottleneck,
and for power.

Now, there are ways to make branchless more efficient. But simply
throwing hardware at it, in the form of extra register ports, is not
necessarily the best way. And for many codes the branchless version is
almost never better.

Nevertheless, I like branchless code and instructions thereof - because
when branch misprediction goes wrong, it goes very, very wrong. So
branchless often prevents really bad glass jaws.

But branchless is often not as much of a win as people would like.
And while some of the reason is an artifact of the implementation, some
is inherent, no matter what hardware.

As for predicting branchless code:

That was one of the big performance IMPROVEMENTS of the later Itanium.

This can sometimes be better than either branchless or branchful.
The predicate misprediction repair mechanism can be more efficient than
branch misprediction repair.

But, in any case: although branchless code can be ,made betterm it is
far from clear that it is worth doing.

In fact, there has been work on hardware reverse IF conversion - getting
rid of the least efficient examples of branchless code, converting them
to branchy code more efficiently than putting a branch in every
predicate or CMOV.

Andy (Super) Glew

unread,

Apr 4, 2013, 6:10:03 AM4/4/13

to

FPPP has (had) a 56KB long inner loop.

Oh, by the way: until specialized loop predictors came into play, loops
much longer than 8 or 10 iterations were guaranteed to have a branch
misprediction at the end.

E.g. a 50 iteration loop would have a branch misprediction rate of 2%.

Which is often twice the average prediction accuracy of well behaved
programs.

This is for counted loops. WHILE loops are worse.

I.e. loops are often NOT the best behaved branches in programs.

nm...@cam.ac.uk

unread,

Apr 4, 2013, 6:31:00 AM4/4/13

to

In article <515D517B...@SPAM.comp-arch.net>,

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:
>

>Oh, by the way: until specialized loop predictors came into play, loops
>much longer than 8 or 10 iterations were guaranteed to have a branch
>misprediction at the end.
>
>E.g. a 50 iteration loop would have a branch misprediction rate of 2%.
>
>Which is often twice the average prediction accuracy of well behaved
>programs.
>
>This is for counted loops. WHILE loops are worse.
>
>I.e. loops are often NOT the best behaved branches in programs.

For counted loops (though not the C/C++ horrors), that is true
only because the hardware lacks the capability to be told about
them. Many architectures had (have?) instructions like decrement
and branch if positive, which can be predicted perfectly.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Apr 4, 2013, 7:35:41 AM4/4/13

to

Way, way back when (1982?) I was told by the lecturer that later went on
to start what became FAST, the search company, that IBM mainframes
tended to spend up to 30% of their CPU inside the (COBOL) sorting routine.

He used this as the lead-in to presenting IBM's patended (but never
produced) "zero-cycle" sorting chip:

It was basically a regular array of storage and comparators that you
could stream data into, then immediately afterwards stream sorted data
out. The problem was of course that as a co-processor it would be
limited by bus transfer rates, and the maximum amount you could handle
was whatever would fit inside the chip.

Paul A. Clayton

unread,

Apr 4, 2013, 10:09:39 AM4/4/13

to

On Apr 4, 6:31 am, n...@cam.ac.uk wrote:
[snip]

> For counted loops (though not the C/C++ horrors), that is true
> only because the hardware lacks the capability to be told about
> them. Many architectures had (have?) instructions like decrement
> and branch if positive, which can be predicted perfectly.

Both Power (POWER/PowerPC) and Itanium have a single
count register intended for this purpose. Both of
those ISAs also have multiple condition values, so--
in theory--nested count-based loops could have the
conditions evaluated in the front end by setting a
condition value before entry into the inner loop
for whether the outer loop counter is greater than
one.

Unfortunately, using such in a deep and wide OoO
implementation is difficult. The PowerPC750
evaluated branches in the front end, but even in
that case the branch instruction needed to be
decoded and evaluated.

In theory, one could have a branch identifier
predictor component that would select a counter
predictor (evaluator if the branch ID came from
predecode on Icache fill) and enough information
about the count could be forwarded to such a
predictor/evaluator in _some_ cases early enough
to allow perfect branch prediction/evaluation
for such branches. However, even for Power with
its special register and instructions, the
complexity of managing the count in the front-end
would be significant.

Consider that x86 only somewhat recently started
optimizing PUSH and POP to avoid excessive data
dependency. A front-end loop count evaluator
would be more complex (especially if a special
register and/or instructions are not used--then
relatively complex idiom detection would be
required and likely more sophisticated hardware
for idiom misrecognition correction) and likely
less broadly useful.

I agree that it is sad that branch conditions
which can be known are instead predicted, but it
does not seem likely that this will change in
the near future (especially when loop count
predictors work fairly well and might catch some
cases that a front-end counter would miss due to
delay between the front end and when an explicit
count is not used for the branch [loop predictors
can work for any branch with a regular transition
rate for one direction and a 100% transition rate
for the other direction]).

nm...@cam.ac.uk

unread,

Apr 4, 2013, 11:13:46 AM4/4/13

to

In article <ae076d39-8df0-4f55...@v8g2000yqe.googlegroups.com>,

Paul A. Clayton <paaron...@gmail.com> wrote:
>[snip]
>> For counted loops (though not the C/C++ horrors), that is true
>> only because the hardware lacks the capability to be told about

>> them. =A0Many architectures had (have?) instructions like decrement

>> and branch if positive, which can be predicted perfectly.
>
>Both Power (POWER/PowerPC) and Itanium have a single
>count register intended for this purpose. Both of
>those ISAs also have multiple condition values, so--
>in theory--nested count-based loops could have the
>conditions evaluated in the front end by setting a
>condition value before entry into the inner loop
>for whether the outer loop counter is greater than
>one.
>
>Unfortunately, using such in a deep and wide OoO
>implementation is difficult. The PowerPC750
>evaluated branches in the front end, but even in
>that case the branch instruction needed to be
>decoded and evaluated.

There's no need to go bonkers. Experience on the System/370
was that the simple BCT instruction was as useful to compilers
as the fancy BXL and BXLE ones. And I really do mean just
decrement and branch if non-zero - which scarcely needs a lot
of logic which isn't already there!

Regards,
Nick Maclaren.

Ivan Godard

unread,

Apr 4, 2013, 12:56:47 PM4/4/13

to

On 4/4/2013 2:32 AM, Andy (Super) Glew wrote:
>
>> Ivan Godard <iv...@ootbcomp.com> wrote:
>>> On 4/2/2013 8:24 PM, MitchAlsup wrote:
>>>> It is like the hit rate of caches has to be considered in the light
>>>> of the
>>>> miss multiplier.
>>>
>>> Yes, although I'm asking in the context of an in-order so there's no
>>> shadowing. But whether shadows are removed or don't exist, I'm still
>>> trying to get a sense of predictor accuracy for open code, outside of
>>> loops. Any number?

>

> This is only true for the simplest possible in-order machine:
>
> an in-oder machine that issues an instruction, and then waits until it
> is finished before starting the next.
>
> Many in-order machines issue and instruction, and then issue the next as
> soon as the previous cannot fault, and so long as the next does not have
> any RAW, WAW, or WAR hazards with the previous. Heck, you can do
> register renaming on an in-ordrer processor, and only stall at the first
> RAW hazard.
>
> Many in-order machines also have branch prediction.
>
> So in-order machines may also execute a large fraction of instructions
> that will never actually be used.
>
> It's the accuracy of prediction,
> * the number of instructions on the wrong-path,
> before the misprediction is corrected for a better-path.

Well, sort of. Granted, I tend to use "out of order" unfairly as a
rubric for everything the Mill doesn't do :-) There is certainly an
important class that is in-order and dynamically scheduled. In the
context with Paul it's the lack of dynamic scheduling that mattered.

On a static schedule exposed pipe, like the Mill and some VLIWs and
DSPs, there are only two cases where you execute instructions you don't
need: deliberate compiler-inserted speculation, and skid buffer issue
overrun because the machine takes more than a cycle to stop once it gets
going. There are no register hazards of any kind (and no need for
register renaming - in the Mill case there aren't any registers to
rename), and memory hazards are part of the programming model and cannot
fairly be called "hazards". Hence there are no shadows.

Except for those two cases, there's no power wasted on obviated
execution. Deliberate speculation is a trade-off of power vs. time that
is independent of and irrelevant to misprediction. The skid-buffer
problem is real; speed of light says you can't stop a core on a dime. In
his current work on the Mill, Art can detect a mispredict and shut down
issue in one cycle, so there's one instruction worth of skid needed,
although to get the rest of the core properly stopped takes three cycles
that overlaps with mispredict recovery. If we brought the clock up to
what other cores do we would certainly need two cycles of skid and maybe
three, and the retire timing of non-speculable ops would need to be
rethought. Fortunately there's only a handful of non-speculable opcodes
out of the thousand or so in total.

In any case, the "accuracy*number of instructions down wrong path"
product is for us the same as accuracy because we only go one issue down
the wrong path. DSPs with smaller cores and lower clocks don't need a
skid buffer at all because they can detect the miss and stop without
overrun, so for them the A*N is a fat zero.

My concern with misses is not so much the power used in wasted
execution, but by the latency until properly restarted on the right
path. Stalls hurt, and a wide-issue machine is very vulnerable to
stalls. While there's only one instruction of skid, there are also four
cycles of stall, minimum, before the first "right" instruction executes.
That's a half to a third the cost of a typical OOO, but still makes
mispredict the second biggest cause of stall, after L2 miss.

Mispredict comes in two flavors: taken misses, where execution takes a
transfer that it didn't expect; and untaken misses, where execution
falls-through where it expected to branch instead. These have different
recovery problems. For example, the taken miss is much more likely to
lead to a miss in the i$1 than is the untaken miss, especially if the
miss is a call - calling a function you never called before is an entry
into cold code that the predictor knows nothing about, and you can
expect to spend a while waiting to warm up. By coincidence I'm working
on the patent filing for our stuff dealing with prediction right now. Or
rather I would be if I weren't writing this :-) Patents are sheer
drudgery.

Ivan

Ivan Godard

unread,

Apr 4, 2013, 12:58:41 PM4/4/13

to

I call it in-order exposed-pipe.

Ivan

Ivan Godard

unread,

Apr 4, 2013, 3:39:44 PM4/4/13

to

On 4/4/2013 3:04 AM, Andy (Super) Glew wrote:
> On 4/3/2013 9:04 AM, Ivan Godard wrote:

<snip>

>>
>> Did you ever figure out *why* if-conversion (conditional-move in yor
>> case) was ineffective? Did the microcode turn it back into a branch?
>
> I wrote up an explanation for the Intel compiler optimization guide.
>

<Andy's elegant and comprehensive description of cmov problems omitted>

We faced exactly the same issue, except that a mispredict is much more
painful for a wide-issue machine like the Mill than for an OOO, for any
given pipeline length.

The biggest fix was to cut the pipe length. The length is a
multiplicative penalty, so a pipe one third as long can stand a 3X
greater miss rate for the same overall cost.

Then we wanted an if-converter operator that was extremely high
performance and low cost, because we wanted the compiler to be doing
conversions all over everywhere. We almost went for predication instead,
and I can see why the ARM and Itanium did so, but single predication
doesn't work well if you expect to be speculating a lot of different
paths concurrently, while Itanium-style multi-predication costs more
encoding bits than we wanted to spend, and mucked up the compiler to boot.

The three-input problem Andy describes is real; almost any machine, Mill
included, is optimized for two-input ops. We had to add special
facilities in our instruction bundle format to deal with the triadic
encoding. The Mill has no general registers, and so no
register-to-register move operations (conditional or otherwise) with
their associated register file port problems, so we could not use a CMOV.

Instead we use a "select" operation (asm mnemonic is "pick" in our case)
that implements the C/C++ "?:" operation. Pick is zero-latency. That is,
"a = b < c ? d : e" has the same latency as "a = b < c", or "a = b + c"
for that matter.

With this, the balance between selection and branch tips hard in favor
of selection, compared with what Andy had to deal with. The Mill, like
the Itanium although done very differently, is designed for heavy and
wide speculation. Many codes that you'd think require branches turn out
not to do so.

For example, the inner per-character loop of "wc" contains a switch on
the input character that does different things for white space, EOL, EOF
and other characters. On a Mill this is a software pipelined
one-instruction loop with ~16 ops in the loop body (including four
"pick"s) when on a big enough Mill family member to have that many FUs.
Otherwise it's a two-cycles-per-character loop body. Either way, the
only branch is the loop itself, which of course predicts perfectly
except for the final iteration that exits to the print-the-counts code.

Incidentally, this is for a naive just-compile-the-code-ma'am
implementation, that doesn't do any vectorization. The naive code
contains a switch on each character. That gets converted to the
equivalent set of parallel "if"s (a common step in compilers) and then
the "if"s get if-converted to "pick".

We used to have a "switch" control flow operation that code like "wc"
would use, but took it back out; prediction was lousy, it mucked up the
instruction flow from the memory hierarchy, and for those cases where
exploding a switch into if's wasn't practical (inner loop of yacc for
example) a data-side jump-address table was as good. The alternative of
a code-side jump-instruction table cluttered the predictor, as I recall
someone noting here, and makes it hard to blind-disassemble code.

Ivan

Ivan Godard

unread,

Apr 4, 2013, 4:15:52 PM4/4/13

to

We've been back and forth on the subject of predicting loops. It's clear
that a decrement-and-branch-if-zero or similar doesn't work for us; it
doesn't encode well, and offers no gain over the equivalent primitives.
What we wanted was a way to tie the execution and prediction engines
together so that the exit-condition testing was done in the predictor,
not the executor. We have something toward that end, but are not that
happy with it and it's not in the current Mill pending some bright
ideas. The problems include nested loops (even if all counted); embedded
calls and interrupts; and mispredicts on intra-loop control flow.

nm...@cam.ac.uk

unread,

Apr 4, 2013, 4:35:25 PM4/4/13

to

In article <kjkmuj$4j0$1...@dont-email.me>,

Ivan Godard <iv...@ootbcomp.com> wrote:
>
>We've been back and forth on the subject of predicting loops. It's clear
>that a decrement-and-branch-if-zero or similar doesn't work for us; it
>doesn't encode well, and offers no gain over the equivalent primitives.

No, it wouldn't. The main gain would be for OoO, because the actual
direction can be detected as soon as the instruction is decoded.

>What we wanted was a way to tie the execution and prediction engines
>together so that the exit-condition testing was done in the predictor,
>not the executor. We have something toward that end, but are not that
>happy with it and it's not in the current Mill pending some bright
>ideas. The problems include nested loops (even if all counted); embedded
>calls and interrupts; and mispredicts on intra-loop control flow.

Good luck. I have never seen it done well, and it's not from
general incompetence. It's not obvious that there is a good
solution.

Regards,
Nick Maclaren.

Andy (Super) Glew

unread,

Apr 5, 2013, 12:47:04 AM4/5/13

to

On 4/4/2013 2:39 AM, nm...@cam.ac.uk wrote:

In-order, for sure.

If it unconditionally issues the next instruction, one per cycle, I
would call it a processor without interlocked pipeline stages.

If the previous instruction can stall, what I would call it would be the
nature of the stall. E.g. it might stall on WAW, WAR, RAW hazards.
Myself I might have called that a scoreboard machine at one time,
although Mitch I think knows the history better, and says that a
scoreboard is something else entirely. RAW hazards only, and I would
call it an in-order machine with register renaming - and probably also a
scoreboard.

If the result of a later instruction is constrained to be written after
all earlier instructions, in-order dispatch in-order completion. Else
in-order dispatch, out-of-order completion.

>
>
> Regards,
> Nick Maclaren.

Ivan Godard

unread,

Apr 5, 2013, 2:42:39 AM4/5/13

to

So what do you call us?
Issue and dispatch the same thing, in-order, stalls if data not
available (no hazard stalls; hazard not meaningful)
Retires statically out of order.
In-flights complete under stall or trap, result replay at unstall/return
preserves retire order and timing (no re-execute)

Perhaps "strange"? :-)

Ivan

Andy (Super) Glew

unread,

Apr 5, 2013, 3:08:52 AM4/5/13

to

Not that strange.

Wen-mei Hwu was my advisor, Prof. VLIW compiler. Your machine sounbds
very much like the sort of machine our research group was working with,
except for the "no registers" part.

How about:
in-order issue, out-of-order complete & commit,
imprecise, draining stall and trap with provision for treplay & recovery

But, at a certain level, one just says "The Mill",
see your paper...

Terje Mathisen

unread,

Apr 5, 2013, 6:01:02 AM4/5/13

to

Ivan Godard wrote:
> Instead we use a "select" operation (asm mnemonic is "pick" in our case)
> that implements the C/C++ "?:" operation. Pick is zero-latency. That is,
> "a = b < c ? d : e" has the same latency as "a = b < c", or "a = b + c"
> for that matter.

Nice!

>
> With this, the balance between selection and branch tips hard in favor

I can see that it would, as I stated my wide (SIMD) code is the one big
win I've seen (lately) from branch elemination.

> of selection, compared with what Andy had to deal with. The Mill, like
> the Itanium although done very differently, is designed for heavy and
> wide speculation. Many codes that you'd think require branches turn out
> not to do so.
>
> For example, the inner per-character loop of "wc" contains a switch on
> the input character that does different things for white space, EOL, EOF

wc? Yes!!! :-)

> and other characters. On a Mill this is a software pipelined
> one-instruction loop with ~16 ops in the loop body (including four
> "pick"s) when on a big enough Mill family member to have that many FUs.
> Otherwise it's a two-cycles-per-character loop body. Either way, the
> only branch is the loop itself, which of course predicts perfectly
> except for the final iteration that exits to the print-the-counts code.

I have probably spent more time optimizing wc than any other programmer,
it was my test target for any new x86 micro-architecture.

I've even done a conference presentation (in the Monterey area, I know
some of you have been there) about cpu micro-architecture where I used
this code to illustrate the code changes needed over the years. :-)

The 16-bit in-order Pentium version suffered quite a bit from the need
to use multiple segments, the extra ES: segment override added 33 (or
50?)% overhead, but it still ran in 1.5 clock cycles/byte afair, while
counting lines, words and chars just like the original wc.

> Incidentally, this is for a naive just-compile-the-code-ma'am
> implementation, that doesn't do any vectorization. The naive code
> contains a switch on each character. That gets converted to the
> equivalent set of parallel "if"s (a common step in compilers) and then
> the "if"s get if-converted to "pick".
>
> We used to have a "switch" control flow operation that code like "wc"
> would use, but took it back out; prediction was lousy, it mucked up the
> instruction flow from the memory hierarchy, and for those cases where
> exploding a switch into if's wasn't practical (inner loop of yacc for
> example) a data-side jump-address table was as good. The alternative of
> a code-side jump-instruction table cluttered the predictor, as I recall
> someone noting here, and makes it hard to blind-disassemble code.

My wc code started out as (naive) unrolled but branchy 8088 asm, then
used a 256-entry lookup table to allow arbitrary sets of word/non-word
characters.

I had state machines in threaded code, then I switched to data
dependencies instead of code-driven state machines. I.e. from a directly
threaded state machine to a totally branchless (well, one loop branch
every 256 input chars) version.

Since the P6 era I have stopped updating it, the last version has 5 or 6
different algorithms, it runs all of them on the first block of data
then picks the fastest for the rest of the processing.

Terje

PS. For a really wide vector machine the fastest approach is probably to
(ab)use the gather hw in order to do parallel lookups of all the current
input bytes, classifying them with a couple of bits each
(CR,LF,other_separator,word-char), pack those bits together into blocks
of 8 or so, then use groups of 10 bits in a second gather operation to
convert this into line and word increments?

Michael S

unread,

Apr 5, 2013, 9:08:52 AM4/5/13

to

On Apr 5, 9:42 am, Ivan Godard <i...@ootbcomp.com> wrote:
> On 4/4/2013 9:47 PM, Andy (Super) Glew wrote:
>
>
>
>
>
>
>
>
>

> > On 4/4/2013 2:39 AM, n...@cam.ac.uk wrote:
> >> In article <515D48A2.2070...@SPAM.comp-arch.net>,

> >> Andy (Super) Glew <a...@SPAM.comp-arch.net> wrote:

From 10,000 miles looks a lot like Alpha EV5. Last major attempt at
this style was IBM Power6. Which was *not* a great success. Actually,
in lights of its successor Power6 appears very bleak.

MitchAlsup

unread,

Apr 5, 2013, 11:56:23 AM4/5/13

to an...@spam.comp-arch.net

On Thursday, April 4, 2013 11:47:04 PM UTC-5, Andy (Super) Glew wrote:
> In-order, for sure.

> If it unconditionally issues the next instruction, one per cycle, I
> would call it a processor without interlocked pipeline stages.

> If the previous instruction can stall, what I would call it would be the
> nature of the stall. E.g. it might stall on WAW, WAR, RAW hazards.
> Myself I might have called that a scoreboard machine at one time,
> although Mitch I think knows the history better, and says that a
> scoreboard is something else entirely. RAW hazards only, and I would
> call it an in-order machine with register renaming - and probably also a
> scoreboard.

I think that the word scoreboard should be reserved for those kinds fo control
logics where instructions are issued in the face of hazzards but instructions
do not begin execution nor do they deliver results while any hazzard is
pending on that instruction. That is a scoreboard manages both the starting
and ending of instructions bases solely on hazzard prevention.

99% of what people call a scoreboard is a simple interlock mechanism.

Only the CDC 6600 and a couple of heavily banked memory controlers that I
have run into meet the scoreboard definition above.

> If the result of a later instruction is constrained to be written after
> all earlier instructions, in-order dispatch in-order completion. Else
> in-order dispatch, out-of-order completion.

Mitch

Ivan Godard

unread,

Apr 5, 2013, 12:37:12 PM4/5/13

to

Yes, as far as pipeline structure goes we are very much like what DSPs
do, especially VLIW DSPs. Not really "imprecise" though; depends on what
you mean by "precise".

Legacy machines model computation as a sequence of isolated operations
executed one by one; a trap is precise if there exists a point at which
everything issued prior in the sequence has completed and is reflected
in global state, and nothing after is. Think traffic on a one-lane road.

On an exposed-pipe parallel machine, the model is of a fleet of many
operations each extending for a defined extent in time and overlapping
in a specified way. Think traffic on an eight lane highway carrying
everything from motorcycles to duplex big rigs; its only coincidence if
there is a point when all eight lanes are empty.

For such a multi-lane CPU, the usual definition of "precise" is not
meaningful; it's precise enough, but overlapping. Or rather, it's
precise w/r/t retire. You can shoot a laser across that highway, and
split the traffic into three groups: that which is completely past the
laser, that which hasn't got to the laser yet, and that which the laser
put a hole in.

At trap, all the holed traffic is set aside, and the set aside and all
that hadn't got to the laser yet have no effect on the global state.
That's quite precise, in the ordinary breakfast table meaning of
precise. It's certainly not "imprecise" in the sense of an unspecified
or unknowable order.

But it is true that we can't partition by issue time the way we
partition by retire time. Consider a motorcycle and a big rig both
approaching the laser point. If the rig is slightly ahead it will issue
before the bike, but the bike is shorter and can clear through before
the laser fires, while the rig, though issued earlier, gets caught by
the laser.

I'd not thought of this traffic analogy before, but it's quite apt.
Perhaps a good word would be to say we (and other exposed pipe machines)
are "retire precise" whereas the garden-variety OOO is "issue precise".

For practical considerations, like dealing with interrupts, either way
of doing precision is quite satisfactory. I just object to calling the
Mill "imprecise", because that word has a bad rep from old machines in
which interrupts left the machine in undefined state, which the Mill
doesn't do at all.

Ivan

Ivan Godard

unread,

Apr 5, 2013, 12:54:15 PM4/5/13

to

I think I have read some of your wc papers and being blown away by the
perverse ingenuity of it all :-)

I'm a language/compiler guy, and a design requirement from the beginning
for the Mill is that it be a compiler target. Not just a compiler
target, but a dumb compiler target. And not just a dumb compiler target,
but a fast, simple, dumb compiler target.

We simply haven't don't have hardware features that would brush your
teeth but can't be found find in random code as-is off the net by a
compiler. I think that the current desperate effort to find a way to
automagically rewrite serial code into parallel is fundamentally
wrong-headed.

In this I have followed the philosophy of the B6500 that was the target
of my first compiler: get the architecture so that a one-pass
non-optimizing naive compiler produces perfect code.

And leave the asm heroics to Terje :-)

Ivan

Ivan Godard

unread,

Apr 5, 2013, 1:03:47 PM4/5/13

to

On 4/5/2013 6:08 AM, Michael S wrote:
> On Apr 5, 9:42 am, Ivan Godard <i...@ootbcomp.com> wrote:

<snip>

>> Perhaps "strange"? :-)

> From 10,000 miles looks a lot like Alpha EV5. Last major attempt at
> this style was IBM Power6. Which was *not* a great success. Actually,
> in lights of its successor Power6 appears very bleak.
>

Very very far from either of those.

Some resemblance to the TI VLIWs, the pipeline especially. Some to the
KDF9. Surprisingly, some to Stretch. A fair amount to multifunction I/O
processor ASICs. Quite a bit to Cisco-style packet engines.

Ivan

Bakul Shah

unread,

Apr 5, 2013, 3:48:22 PM4/5/13

to

On 4/5/13 9:54 AM, Ivan Godard wrote:
> I'm a language/compiler guy, and a design requirement from the beginning for the Mill is that it be
> a compiler target. Not just a compiler target, but a dumb compiler target. And not just a dumb
> compiler target, but a fast, simple, dumb compiler target.
>
> We simply haven't don't have hardware features that would brush your teeth but can't be found find
> in random code as-is off the net by a compiler. I think that the current desperate effort to find a
> way to automagically rewrite serial code into parallel is fundamentally wrong-headed.

What choice do you have if you want to make existing code run
faster? May be you need both static and dynamic analysis to
speed things up as you have access to different information
at compile, link and runtime. And may be you need to gather
historical data as well.

> In this I have followed the philosophy of the B6500 that was the target of my first compiler: get
> the architecture so that a one-pass non-optimizing naive compiler produces perfect code.

Is "perfect" achievable? There are so many optimization
tricks. For instance, how can you do runtime strength
reduction or move loop invariant code out of the loop or
interprocedural (or Stalin style whole program) optimization?
Nevertheless this sounds very interesting....

Ivan Godard

unread,

Apr 5, 2013, 5:15:00 PM4/5/13

to

On 4/5/2013 12:48 PM, Bakul Shah wrote:
> On 4/5/13 9:54 AM, Ivan Godard wrote:
>> I'm a language/compiler guy, and a design requirement from the
>> beginning for the Mill is that it be
>> a compiler target. Not just a compiler target, but a dumb compiler
>> target. And not just a dumb
>> compiler target, but a fast, simple, dumb compiler target.
>>
>> We simply haven't don't have hardware features that would brush your
>> teeth but can't be found find
>> in random code as-is off the net by a compiler. I think that the
>> current desperate effort to find a
>> way to automagically rewrite serial code into parallel is
>> fundamentally wrong-headed.
>
> What choice do you have if you want to make existing code run
> faster? May be you need both static and dynamic analysis to
> speed things up as you have access to different information
> at compile, link and runtime. And may be you need to gather
> historical data as well.

The two approaches are somewhat incompatible. Clearly a statically
scheduled machine cannot use dynamic information other than by using a
JIT to produce rewritten code at runtime. However, a dynamically
scheduled machine cannot usually make much use of static scheduling
either, except in carefully hand-tuned code. The problem is partly that
the compiler usually does not know what the target machine really is
(out of several models or versions), and even if the model is available
and reliable the potential impact of static scheduling choices for that
model is not publicly known. In addition, what constitutes the optimal
schedule on a dynamic machine commonly depends on dynamic and transient
considerations that will be unknowable to a compiler.

That said, compilers for dynamic machines can certainly do a better
static job than they usually do. The question is whether the compiler
work is worth the trouble, and the externality problem: who pays and who
gains.

>> In this I have followed the philosophy of the B6500 that was the
>> target of my first compiler: get
>> the architecture so that a one-pass non-optimizing naive compiler
>> produces perfect code.
>
> Is "perfect" achievable? There are so many optimization
> tricks. For instance, how can you do runtime strength
> reduction or move loop invariant code out of the loop or
> interprocedural (or Stalin style whole program) optimization?
> Nevertheless this sounds very interesting....

"Perfect" is available if the instruction set is not duplicative. If
there is only one way (on the low-level hardware) to do some high-level
language thing, then code is "perfect" if the compiler generates that
code and no more.

"Perfect" is impossible when there are a host of hardware ways to do
something but no good way (or too expensive) to choose which is "best".
The VAX was notorious for this; there were tons of addressing modes,
each with tiny advantages over the others in special corner cases. As a
result instruction selection was a compiler pain, and rarely done well.

I'll display my biases here: instruction sets designed by compiler
writers tend to be semantically simple and easy to generate "perfect"
code for, while instruction sets designed by hardware engineers tend to
be mechanically simple and very hard to get code "perfect" for. Thus
stack machines like the KDF9 and its spiritual relatives have easy
compilers and perfect code, while the DEC Alpha tended to the reverse,
and in addition dumped hard problems like memory consistency past the
compiler onto the user. Who promptly dropped the ball of course.

I suspect Nick programmed the KDF9 at some point in his career, and
probably shares my admiration for the design given the state of the art
in the day.

Note that semantic simplicity in the instruction set is sideways to the
CISC/RISC question, which has to do more with simplicity in the hardware
implementation. There have been semantically complex RISC designs and
semantically simple CISC designs, the latter including the Mill. Some of
the Mill operations perform very high level actions that would take
dozens of operations on a legacy instruction set. Yet these operations
map directly to the semantics of programming language primitives; "call
a function with arguments" for example, or "conditionally exit an inner
loop". More, these hardware operations are the only way to implement the
semantic primitive; you couldn't really compose a function call out of
other Mill operation if you wanted to, short of writing a simulator for
some other machine.

So perfect code is pretty much a given, and the compiler does almost no
optimization. We do want to know data lifetimes so we can do things like
overwrite stack frame locations with variable from disjoint blocks, and
we do want vectorization so we can use the SIMD hardware. But the former
has been in compilers since before 1960, and the latter dates to the
MassComp Fortran compilers of the 1970s, and both are routine in all
compilers these days.

Ivan

Joe keane

unread,

Apr 5, 2013, 8:39:44 PM4/5/13

to

In article <kjjkp4$h54$1...@needham.csi.cam.ac.uk>, <nm...@cam.ac.uk> wrote:
>Many architectures had (have?) instructions like decrement and branch
>if positive, which can be predicted perfectly.

x86 has both

MitchAlsup

unread,

Apr 6, 2013, 12:03:21 AM4/6/13

to

On Friday, April 5, 2013 4:15:00 PM UTC-5, Ivan Godard wrote:
> The two approaches are somewhat incompatible. Clearly a statically
> scheduled machine cannot use dynamic information other than by using a
> JIT to produce rewritten code at runtime.

disagree, one of the benefits of a BG OoO microarchitecture is that
almost all programes that are optimal when scheduled by a good
o[ptimizing compiler for a given microarchitecture can be run "pretty"
effiently" on a BGOoO. Can you do better, sure; but at a certrain point
most applications become "good enough".

For example, we did a BGOoO machine in the late 80s and early 90s and got*
about 2.2 Inst/cycle running GCC SPEC 89 with code scheduled for the Mc88100.
We got almost 5.99 Inst/cy running matrix 300.

(*) cycle accurat system simulator (including stuff like DRAM refresh cycles and pin turnarounds) running Mc88100 binaries unmodified.

All of this was done with 16KB direct mapped caches banked 4 ways and an
ECL logic level DIDO** buss architecture to circa 1990 DRAM technology.

(**) data-in,Data-out a bussing scheme by which the electrical wires are always run in the same direction. In the 88120 we had 4 busses, read-out (AO), write-out (DO), read-in (DI), Address-in(AI). This interface coult deliver vector data rates on Matrix 300. We planned on needing a double word in per cycle to keep the FP units "continuously occupied" and maintaining a corresponding "doubel word every other clock" write bandwidth.

After we got the conditional cache working well, we needed only 1/2 the planned read bandwidth.

The address in bus was for a protoal we were testing for "paranoid" users (like banks for example). Under this system the write buffer resource was released when the memory system send an ack that the data arrived without bit error. Comparing performance of the optimistic mode with paranoid mode we saw about a 2% overall degredation (given the excess BW of the aforementioned memory interconnect.)

> However, a dynamically
> scheduled machine cannot usually make much use of static scheduling
> either, except in carefully hand-tuned code.

This is the key point, given a static schedule that IS optimal on a smaller
less wide almost identical instruction set runs at least as well and often
better in the face of BGOoO. It still may not be as good as great code for the BGOoO microarchitecture. But BGOoOedness tends to be a great equalizer.

Mitch

Andy (Super) Glew

unread,

Apr 6, 2013, 12:26:37 AM4/6/13

to

My personal definition of "scoreboard" - by personal, I mean the
definition I carried in my head, and used to somewhat contrast with the
sort of out-of-order design that I have done - and I adm

Terje Mathisen

unread,

Apr 6, 2013, 10:15:38 AM4/6/13

to

:-)

On P6 the combination of OoO and a very fast L1 cache tended to speed up
compiled code to the point where it ran neck&neck with pretty good
hand-optimized Pentium-targeted asm.

A few extra spills to stack locations never caused any significant
increase in memory bus activity, so as long as the compiler managed to
keep the most critical items in registers (mostly those used for
addressing) the OoO hw would (almost/nearly) mask the fact that the
compiled code would have taken 1.5 to 2x as long as asm on the in-order P5.

Terje

William Clodius

unread,

Apr 6, 2013, 4:12:52 PM4/6/13

to

MitchAlsup <Mitch...@aol.com> wrote:

> On Friday, April 5, 2013 4:15:00 PM UTC-5, Ivan Godard wrote:
> > The two approaches are somewhat incompatible. Clearly a statically
> > scheduled machine cannot use dynamic information other than by using a
> > JIT to produce rewritten code at runtime.
>
> disagree, one of the benefits of a BG OoO microarchitecture is that
> almost all programes that are optimal when scheduled by a good
> o[ptimizing compiler for a given microarchitecture can be run "pretty"
> effiently" on a BGOoO. Can you do better, sure; but at a certrain point
> most applications become "good enough".

> <snip>

I am frequently acronym challenged. What does BG stand for?

MitchAlsup

unread,

Apr 6, 2013, 5:27:46 PM4/6/13

to

On Saturday, April 6, 2013 3:12:52 PM UTC-5, William Clodius wrote:
> What does BG stand for?

It was supposed to be GB OoO as in Great Big Out of Order.
I must have been having a dyslexic day at the keyboard.

Mitch

Terje Mathisen

unread,

Apr 7, 2013, 4:59:31 AM4/7/13

to

Heh, I read it as Big Great, which sort of emphasizes how 'great' OoO hw
is. :-)

Terje

MitchAlsup

unread,

Apr 7, 2013, 2:14:13 PM4/7/13

to

On Sunday, April 7, 2013 3:59:31 AM UTC-5, Terje Mathisen wrote:
> Heh, I read it as Big Great, which sort of emphasizes how 'great' OoO hw is.

Now that IS funny!

Mitch

Andy (Super) Glew

unread,

Apr 7, 2013, 4:10:44 PM4/7/13

to

I'm on a plane, and Thunderbird did not synchronize all of this thread,
so I don't know what a BGOoO is.

But: I think that an OOO machine *should* be able to take advantage
static scheduling.

The dirty little secret of OOO is that we are often not very much OOO at
all - and the places where we are are often artifacts like
compare-branch. The big benefit of OOO is memory and cache misses, but
a secondary benefit is not having to put to many things in so many
different registers - you can reuse the same logical name many times,
and OOO basically just uses it to connect producer to consumers in a
dataflow sense. It would be nice if, once having done that, we did not
ti even try to write the results to a register file.

Time and again, in a loop, the same instructions have the same schedule.

It is criminal, however - ok, it seems suboptimal - that the OOO
hardware should be recalculating the dataflow graph and the schedule
that repeats itself so many times. Surely we could calculate once and
for all, in the compiler (static scheduling). Or, calculate it in the
hardware, and reuse it (semi-static scheduling).

Current OOO hardware does the most general and expensive scheduling for
every instruction, eagerly. This is simple, because it is one mechanism
that works for everything. But, if this calculation is expensive, power
hungry...

Perhaps what we should be doing is building machines that

a) take a static or semi-static schedule

b) perform a simple, cheap, calculation to determine

b.1) if the static or semi-static schedule is correct - e.g. if it does
not try to start a dependent instruction before its inputs are ready

b.2) but also if it is excessively conservative - if there is excessive
slack - if the instruction could have started earlier. But also,
perhaps an indication that "Yes, the compiler knew this instruction
could have started earlier, but it was deliberately delayed for some
reason". Although such a reason is usually something like a funky
pipeline resource dependency - or, rather, a resource dependency that is
not uniformly pipelined, a resource that appears at different places in
different pipelines, like using the FP-adder at the last stage of an FP
divide,.

c) for that matter, verify dependencies between instructions. Perhaps
not interesting for dependencies through registers, but more interesting
if the dependency graph is computed, e.g. dependencies carried through
memory, or through conditional operators phi operators - where we are
trying to do better than depend on all inputs. (This amounts to
predicting CMOVs or predicates.)

d) if the verification of schedule and dependency passes, no need to do
anything else - go straight to execution. (Or, you may have executed
before or while verifying in which case no need to undo execution.)

d') but if the verification of schedule and dependency fails, use the
more powerful OOO dataflow scheduling mechanisms to make things better.
And optionally remember the improved schedule (semi-static).

Traditional code scheduling tries to arrange instructions so that they
don't stall. Independent operations are interleaved. The compiler tries
to deliver instructions "just in time".

But this is not necessarily the best thing to do. Perhaps it is better
to have dependent instructions arranged in chains, back to back. Saves
registers, or at least register names. I have spent a lot of time
designing schedulers for "strings" or "chains" of dependent
instructions. Perhaps the compiler should give us these chains,
and instead of trying to arrange independent instructions next to each
other, should tell us where the next independent chains begin.

I believe Mark Smotherman calls this a dependence architecture, rather
than the independence architecture that classic VLIW and RISC
superscalar scheduling for in-order architectures try to be.

The problem with such chains is cross-chain dependencies. Also, I gave
up stopped pursuing this at UWisc when it became evident that the
dependency chains would have to run through memory, requiring memory
renaming or STLF store to load forwarding prediction. on x86 and Alpha.

Sometimes I think that what we need is the smallest possible code size,
with an auxiliary datastructure indicating hints as to what the static
schedule should be. What instructions should execute when, which loads
may be dependent on which stores and which they are guaranteed
independent of. ("Static store/load coloring").

Some ISA proposals specify which instructions depend on an instruction,
rather than having such dependencies specified indirectly through registers.
http://semipublic.comp-arch.net/wiki/Instruction_Destination_Instructions_rather_than_Registers

Trouble is, while all of these things are neat, they just don't seem to
be efficient in terms of instruction encoding.

This might be an argument for caching such semi-static results, i.e.
storing them in a decoded and scheduled I$ rather than as the true ISA.
But even here the storage needs to be more efficient than recomputing
the dependency graph and schedule.

I suppose than one of the things I like about vectors and SIMD ISAs is
that the dependency graph and schedule is amortized over many copies of
the same operation.

--
The content of this message is my personal opinion only. Although I am e

Andy (Super) Glew

unread,

Apr 7, 2013, 4:26:34 PM4/7/13

to

On 4/5/2013 9:37 AM, Ivan Godard wrote:
> On 4/5/2013 12:08 AM, Andy (Super) Glew wrote:
>> On 4/4/2013 11:42 PM, Ivan Godard wrote:
>>> On 4/4/2013 9:47 PM, Andy (Super) Glew wrote:
>>>> On 4/4/2013 2:39 AM, nm...@cam.ac.uk wrote:
>>>>> In article <515D48A2...@SPAM.comp-arch.net>,
>>>>> Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:
>>>>>>> Ivan Godard <iv...@ootbcomp.com> wrote:
>>>>>>
>>> So what do you call us?
>>> Issue and dispatch the same thing, in-order, stalls if data not
>>> available (no hazard stalls; hazard not meaningful)
>>> Retires statically out of order.
>>> In-flights complete under stall or trap, result replay at unstall/return
>>> preserves retire order and timing (no re-execute)
>>>
>>> Perhaps "strange"? :-)
>>>
>>> Ivan
>>
>>
>> Not that strange.
>>

>> Wen-mei Hwu was my advisor, Prof. VLIW compiler. Your machine sounds

> For practical considerations, like dealing with interrupts, either way
> of doing precision is quite satisfactory. I just object to calling the
> Mill "imprecise", because that word has a bad rep from old machines in
> which interrupts left the machine in undefined state, which the Mill
> doesn't do at all.
>
> Ivan

The sort of things Wen-mei's IMPACT group looked at included

0) conventional "precise" exceptions - everything before the PC had
executed, nothing after

1) compiler driven "semiprecise" - everything before some commit PC had
completely executed, but stuff afterwards more or may not have executed.
The compiler would generate recovery code to clean it up to precise
state, just as the compiler has to generate similar recovery code when
you take a branch out of a scheduled instruction trace. The compiler
might know that certain registers may have been overwritten by the
semi-finished code, so it might have to recompute those from other
registers or memory.

2) hardware driven semi-precise. Typically, a program counter plus a
bitmask indicating which subsequent instructions had or had not executed.

Sometimes this bitmask was sufficient for recovery of PC precise state.
A trap handler would just look at the instructions, and emulate.

Sometimes compiler help as necessary, a combination or 1 and 2, because
some of the executed instructions may have overwritten state needed for
the recovery.

Larrabee's use of vector mask to indicate which parts of a gather load
have completed is a very simple example of this PC+bitmask precise opr
semiprecise recovery.

99) I suppose an even more aggressive version of semiprecise is ti
record the instuctions in flight in a window. Instruction bytes
(because even the instructions may have been overwritten), etc.

nm...@cam.ac.uk

unread,

Apr 7, 2013, 4:27:41 PM4/7/13

to

In article <5161D2C...@SPAM.comp-arch.net>,

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:
>

>The dirty little secret of OOO is that we are often not very much OOO at
>all - and the places where we are are often artifacts like
>compare-branch. The big benefit of OOO is memory and cache misses, but
>a secondary benefit is not having to put to many things in so many
>different registers - you can reuse the same logical name many times,
>and OOO basically just uses it to connect producer to consumers in a
>dataflow sense. It would be nice if, once having done that, we did not
>ti even try to write the results to a register file.

Yes. That is one of the reasons that I think a transaction-style
ISA would be a good idea, provided that the transactions were used
to free up the hardware rather than constraint it.

Regards,
Nick Maclaren.

Ivan Godard

unread,

Apr 7, 2013, 5:05:31 PM4/7/13

to

On 4/7/2013 1:26 PM, Andy (Super) Glew wrote:

<snip>

>
> The sort of things Wen-mei's IMPACT group looked at included
>
> 0) conventional "precise" exceptions - everything before the PC had
> executed, nothing after

Let's call those "issue-precise".

>
> 1) compiler driven "semiprecise" - everything before some commit PC had
> completely executed, but stuff afterwards more or may not have executed.
> The compiler would generate recovery code to clean it up to precise
> state, just as the compiler has to generate similar recovery code when
> you take a branch out of a scheduled instruction trace. The compiler
> might know that certain registers may have been overwritten by the
> semi-finished code, so it might have to recompute those from other
> registers or memory.

Call those "compiler assisted issue-precise"

> 2) hardware driven semi-precise. Typically, a program counter plus a
> bitmask indicating which subsequent instructions had or had not executed.
>
> Sometimes this bitmask was sufficient for recovery of PC precise state.
> A trap handler would just look at the instructions, and emulate.
>
> Sometimes compiler help as necessary, a combination or 1 and 2, because
> some of the executed instructions may have overwritten state needed for
> the recovery.
>
> Larrabee's use of vector mask to indicate which parts of a gather load
> have completed is a very simple example of this PC+bitmask precise opr
> semiprecise recovery.

Call those "compiler assisted issue-precise with hardware hints"

> 99) I suppose an even more aggressive version of semiprecise is ti
> record the instuctions in flight in a window. Instruction bytes
> (because even the instructions may have been overwritten), etc.

100) "result-precise" :-)

Ivan Godard

unread,

Apr 7, 2013, 6:12:53 PM4/7/13

to

Exactly. Then why have registers?

> Time and again, in a loop, the same instructions have the same schedule.
>
> It is criminal, however - ok, it seems suboptimal - that the OOO
> hardware should be recalculating the dataflow graph and the schedule
> that repeats itself so many times. Surely we could calculate once and
> for all, in the compiler (static scheduling). Or, calculate it in the
> hardware, and reuse it (semi-static scheduling).
>
> Current OOO hardware does the most general and expensive scheduling for
> every instruction, eagerly. This is simple, because it is one mechanism
> that works for everything. But, if this calculation is expensive, power
> hungry...

Amen. Nancy Reagan was right: just say no. :-)

> Perhaps what we should be doing is building machines that
>
> a) take a static or semi-static schedule
>
> b) perform a simple, cheap, calculation to determine
>
> b.1) if the static or semi-static schedule is correct - e.g. if it does
> not try to start a dependent instruction before its inputs are ready

I suppose that a no-op counts as a simple, cheap calculation.

> b.2) but also if it is excessively conservative - if there is excessive
> slack - if the instruction could have started earlier. But also,
> perhaps an indication that "Yes, the compiler knew this instruction
> could have started earlier, but it was deliberately delayed for some
> reason". Although such a reason is usually something like a funky
> pipeline resource dependency - or, rather, a resource dependency that is
> not uniformly pipelined, a resource that appears at different places in
> different pipelines, like using the FP-adder at the last stage of an FP
> divide,.

Yep. But if the compiler knows the pipeline structure then it can avoid
all those hazards. Or that's what we rely on - there are a couple
hazards like that in a Mill (FMA is one, and FDIV would be another if we
did DIV in hardware).

And it's true that OOO can catch early-out cases (multiply by one for
example), but we've looked hard for early-outs that pay enough to be
measurable, and never found any.

> c) for that matter, verify dependencies between instructions. Perhaps
> not interesting for dependencies through registers, but more interesting
> if the dependency graph is computed, e.g. dependencies carried through
> memory, or through conditional operators phi operators - where we are
> trying to do better than depend on all inputs. (This amounts to
> predicting CMOVs or predicates.)

Tracking dependencies carried through memory leads you to a true
dataflow machine, Manchester MU-5 etc. I have long felt there was more
to be had with dataflow, but it need CPU-in-memory to be viable.

Phi ops (our "pick". We used to actually call it "phi" but changed when
presentations got everybody confused) don't seem to have enough
difference between the path lengths to be worth tracking, at least in
the codes we see. Rarely do you see an L2 miss on one leg and a constant
on the other, so we feel that it's better going wide and speculating
both legs and making each leg wait on on both. YMMV.

> d) if the verification of schedule and dependency passes, no need to do
> anything else - go straight to execution. (Or, you may have executed
> before or while verifying in which case no need to undo execution.)
>
> d') but if the verification of schedule and dependency fails, use the
> more powerful OOO dataflow scheduling mechanisms to make things better.
> And optionally remember the improved schedule (semi-static).

But then aren't you paying the power and area for that OOO even when it
isn't used?

> Traditional code scheduling tries to arrange instructions so that they
> don't stall. Independent operations are interleaved. The compiler tries
> to deliver instructions "just in time".
>
> But this is not necessarily the best thing to do. Perhaps it is better
> to have dependent instructions arranged in chains, back to back. Saves
> registers, or at least register names. I have spent a lot of time
> designing schedulers for "strings" or "chains" of dependent
> instructions. Perhaps the compiler should give us these chains,
> and instead of trying to arrange independent instructions next to each
> other, should tell us where the next independent chains begin.

This is just an encoding issue. The input to compiler scheduling is a
dependency forest with shared inner nodes, with nodes annotated with
latency. Somebody has to figure out which nodes can be done together. If
your encoding can give that tree representation to the hardware then the
hardware can do it, or the compiler can do it and give a more
conventional encoding to the hardware.

The code won't be a forest of disjoint trees, such that you can assure
independence by assigning an op to one tree or another and any given
tree can be executed in parallel with any other tree. Instead as you
move leafward in a tree you hit nodes that are shared by more than one
tree. Now you can cut the trees sharing a node at the shared node, and
let each consider that shared node a leaf, and add another tree that
computes the shared node, and that will give you a forest of isolated
trees. But now your tree height is too short to get much value from the
linear execution it promises.

Bill Wulf (old buddy of mine) did a machine that took advantage of the
shortness of the resulting isolates. Each instruction had two opcodes
that were implicitly connected dataflow style, with three inputs.
Essentially, an FMA except for permitting any pair of opcodes. Two was
all she wrote though, but his numbers showed that two was all there was
available in the code anyway, so it was no limitation.

Now you can get longer chains with speculation; welcome to trace caches.
But then you have to be prepared to undo it all. I may be more doubtful
about that than is warranted - I never worked on one - but I really
don't see any reason for one when you have a perfectly good compiler
just sitting there.

> I believe Mark Smotherman calls this a dependence architecture, rather
> than the independence architecture that classic VLIW and RISC
> superscalar scheduling for in-order architectures try to be.
>
> The problem with such chains is cross-chain dependencies. Also, I gave
> up stopped pursuing this at UWisc when it became evident that the
> dependency chains would have to run through memory, requiring memory
> renaming or STLF store to load forwarding prediction. on x86 and Alpha.

I should read ahead before writing a response :-(

> Sometimes I think that what we need is the smallest possible code size,
> with an auxiliary datastructure indicating hints as to what the static
> schedule should be. What instructions should execute when, which loads
> may be dependent on which stores and which they are guaranteed
> independent of. ("Static store/load coloring").

Loads are the only problem; everybody else the compiler can compact and
schedule as well as the hardware ever could.

> Some ISA proposals specify which instructions depend on an instruction,
> rather than having such dependencies specified indirectly through
> registers.
> http://semipublic.comp-arch.net/wiki/Instruction_Destination_Instructions_rather_than_Registers

Dataflow again. If you figure out how to do the matchbox then I'm with
you. A CAM with 100k or so entries isn't easy. Debugging is also an issue.

> Trouble is, while all of these things are neat, they just don't seem to
> be efficient in terms of instruction encoding.
>
> This might be an argument for caching such semi-static results, i.e.
> storing them in a decoded and scheduled I$ rather than as the true ISA.
> But even here the storage needs to be more efficient than recomputing
> the dependency graph and schedule.

It's an argument many have made and not a few have implemented. The
device used to compactly store the decoded and scheduled instructions
is called a "load module" :-)

Robert Wessel

unread,

Apr 8, 2013, 1:05:38 PM4/8/13

to

On Thu, 04 Apr 2013 13:35:41 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"> wrote:

>Ivan Godard wrote:
>> On 4/3/2013 9:16 PM, MitchAlsup wrote:
>>> On Wednesday, April 3, 2013 2:44:42 PM UTC-5, Terje Mathisen wrote:
>>>> I have of course used branchless code for things like max() and min(),
>>>
>>> Max, min, and saturate SHOULD be instructions these days.
>>>
>>
>> Mill does saturate, but it uses our zero-latency equivalent of
>> conditional move for max and min; the timing is the same as the
>> relational op. People have occasionally argued that max and min should
>> be native to avoid the relational result adding to operand clutter, but
>> have never yet convinced the rest that it was worth the ISA entropy.
>>
>>> But I think it would be highly instructive to compare the performance of
>>> branch predictors while running excursions through the OS and ignore the
>>> performance of the same predictors while running applications. The best
>>> performance on the excursions through the OS wins.
>>>
>>
>> I recall an estimate that 25% of OS cycles were memcpy and memset. Those
>> should predict pretty well :-)
>
>Way, way back when (1982?) I was told by the lecturer that later went on
>to start what became FAST, the search company, that IBM mainframes
>tended to spend up to 30% of their CPU inside the (COBOL) sorting routine.

Sort in general, not usually Cobol specific. And generally (not
always), Cobol internal sorts on that platform invoke the system sort
package anyway, so there's no real difference.

>He used this as the lead-in to presenting IBM's patended (but never
>produced) "zero-cycle" sorting chip:
>
>It was basically a regular array of storage and comparators that you
>could stream data into, then immediately afterwards stream sorted data
>out. The problem was of course that as a co-processor it would be
>limited by bus transfer rates, and the maximum amount you could handle
>was whatever would fit inside the chip.

Which doesn't quite seem worthy of a patent (although that doesn't
seem to be in the list of the PTOs requirements). Simply maintaining
a priority queue out to be sufficiently parallelizable to accomplish
that (pipeline the rotations while adding to, or removing from, the
heap as the operation moves down the heap). Although as you mention
this is pointless for any real datasets due to size constraints.
Especially in 1982.

You could repeatedly sort a dataset with something like this, by
filling the device, and then alternately removing the lowest item and
adding a new one, but you'd end up with something like a bubble sort.
Several passes with different step sizes might turn this into
something akin to a Shell sort.

FWIW, Z has had some ISA sorting assists for decades now.

nm...@cam.ac.uk

unread,

Apr 8, 2013, 1:19:40 PM4/8/13

to

In article <0it5m8lkrunu0hfmm...@4ax.com>,

Robert Wessel <robert...@yahoo.com> wrote:
>>
>>Way, way back when (1982?) I was told by the lecturer that later went on
>>to start what became FAST, the search company, that IBM mainframes
>>tended to spend up to 30% of their CPU inside the (COBOL) sorting routine.
>
>Sort in general, not usually Cobol specific. And generally (not
>always), Cobol internal sorts on that platform invoke the system sort
>package anyway, so there's no real difference.

And what most people miss (not present posters) is that such sorts
are nowadays almost always limited by memory bandwidth (sometimes
latency), and what the actual CPU core gets up to is irrelevant.
The same is true to a great extent of FFTs.

So branch prediction in such codes is relevant only insofar as it
helps with memory access - at least to a first approximation.

Regards,
Nick Maclaren.

Ivan Godard

unread,

Apr 8, 2013, 1:36:48 PM4/8/13

to

On 4/8/2013 10:05 AM, Robert Wessel wrote:

<snip>

>
> FWIW, Z has had some ISA sorting assists for decades now.
>

Z? Cite please?

Ivan

Ivan Godard

unread,

Apr 8, 2013, 2:05:46 PM4/8/13

to

Yes. In many applications even a mid-range Mill family member is
pin-limited in today's technology; from the beginning we were pretty
sure that would be the case.

However, when we started we decided that the trend in memory tech was
bigger, closer, and much lower latency, and that we should be prepared
for a return to memory access time to close the CPU cycle time, just
like the old days. A pretty daring idea then, but it does seem like we
nailed it. We expect on-chip memory in large quantities by the time we
get to do a high-end firehose.

The consequences to cache design, and memory hierarchy design in
general, are profound. Think of having all today's on-card memory as
on-chip L3. There will still be long latencies to deal with, but only
between chips, which means only in server and HPC. Treating that huge
memory as cache (so you can do shared memory among chips) looks very
expensive when it could be simple local memory instead, with message
passing at the board/box level.

We're not wed to this vision: we have caches and can do shared memory
with coherence. But we are also designing for a messaging world.

Ivan

nm...@cam.ac.uk

unread,

Apr 8, 2013, 2:13:17 PM4/8/13

to

In article <kjuv2l$evc$1...@dont-email.me>,

IBM System/Z, previously known as System/390. There are a few
fairly bonkers instructions to help DFSORT. Whether any other
program (including other sort packages) has ever found them
useful, I can't say.

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Apr 8, 2013, 2:36:39 PM4/8/13

to

I suspect Robert is referring to IBM's zSeries (a.k.a.
S/360 descendants), which can have zSeries Application
Assist Processors (zAAPs) and zSeries Integrated
Information Processors (zIIPs).

http://www-03.ibm.com/systems/z/hardware/features/ziip/gettingstarted/prereqs.html
states the following:
IBM DB2 for z/OS Version 8 or DB2 9 DB2 utilities
enables portions of sorting of fixed-length records
using IBM's memory object sorting technique to be
eligible for the zIIP.

and:
In addition, a portion of DB2 for z/OS V8 and DB2 9
DB2 utilities for sorting fixed-length records using
IBM's memory object sorting technique can be eligible
for zIIP.

Robert Wessel, being a regular provider of information
about IBM zSeries here (and at the Real World Technologies
forum), will probably provide more detailed information
shortly. But it looks like I can give a not-slow,
not-fast answer before he provides a good one.

Terje Mathisen

unread,

Apr 8, 2013, 2:37:51 PM4/8/13

to

OTOH, with a sufficiently wide parallel machine it makes sense to use
less efficient (measured as power or at least as executed
instructions/compares) algorithms which works on N items, doing N/2
parallel compares.

Robert Wessel

unread,

Apr 8, 2013, 6:04:45 PM4/8/13

to

On Mon, 08 Apr 2013 10:36:48 -0700, Ivan Godard <iv...@ootbcomp.com>
wrote:

The Update Tree (UPT) and Compare and Form Codeword (CFC) instructions
date back to the late eighties.

Robert Wessel

unread,

Apr 8, 2013, 6:26:30 PM4/8/13

to

The zIIP/zAAP thing is different. With a zIIP or zAAP, some eligible
code can run on one of those specialty processors*, which tend to be
significantly cheaper than the otherwise identical** CPs, which can be
used to run "normal" workloads. Various zOS workloads can run some of
their code on the zIIP or zAAP (various databases, sorts, the JVM, XML
processing, etc.) where cycles are cheaper (and often faster). From
an exploiting application's perspective, and somewhat oversimplified,
you just ask zOS to call an eligible routine for you, and the OS then
schedules that thread on an available specialty processor.

Other than that, there's nothing special about zIIPs or zAAPS in terms
of sort support, but as I mentioned in another post, there are some
instructions in the ISA (CFC and UPT) intended to help sort.

As you might expect, customer written code is not generally eligible
to run on zIIPs or zAAPs, but ISVs can, with IBM's blessing, make use
of those. A few years ago there was an ISV that tried selling a
package to allow customer written code to dispatch on specialty
processors, but that got stopped pretty quickly.

*There are other specialty processors, notably IFLs (for running
Linux) and ICFs (for running coupling facilities), but those exist
outside of zOS.

**There is an artificial restriction on running the traditional OSs on
specialty processors (basically there's a secret instruction that's
disabled on the specialty processors, and zIIPs and zAAPs refuse to
perform the IPL (boot) function). In the past zAAPs had restrictions
on doing I/O (they were intended to run JVM code), but zAAPs are being
folded into zIIPs, with more recent systems supporting the "zAAP on
zIIP" feature.

Ivan Godard

unread,

Apr 8, 2013, 10:10:22 PM4/8/13

to

"The Theory Behind the z/Architecture Sort Assist Instructions"

http://www.kcats.org/share/sort/slides/SortAst.pdf

Terje Mathisen

unread,

Apr 9, 2013, 3:35:54 AM4/9/13

to

Ivan Godard wrote:
>
> "The Theory Behind the z/Architecture Sort Assist Instructions"
>
> http://www.kcats.org/share/sort/slides/SortAst.pdf
>

Thanks Ivan!

That was an interesting piece of information, particularly what it shows
about what common IBM workloads must look like:

Lots and lots of fixed-length/fixed-record (text) sort keys, often more
or less sorted to begin with, very important to avoid worst-case
(O(n^2)) scenarios.

Using 2N nodes to sort N items does allow a lot more algorithms than
when you're forced to do all updates in place.

I do note that the binary three-based sorting algorithm they seem to
have settled upon is power-optimized, doing as few new comparisons as
possible after each insertion/extraction.

OTOH it really doesn't seem to even try to be cache-friendly, unless
they wrap this code in a chunking layer that works in L2/L3-sized blocks?

Terje

PS. A long time ago when I needed to sort semi-random record layouts,
with ascii/binary/fp keys in various locations in each record, I wrote a
reversible transformation stage that would start by extracting each key
field (in sort order) and modify it so that it could be sorted in byte
order, then appending any/all the remaining bytes from the input record.

This was an O(N) operation on both input and output, but it allowed the
O(n lg(n)) quick sort operations to use inline byte block compares, i.e.
something like memcmp().

The transformations I used were something like this afair:

Ascii -> Do-nothing
unsigned 16-bit/32-bit -> byte swapped
signed 16-bit/32-bit -> bias to make unsigned (add MININT) and byte swap
float/double -> flip sign bit and byte swap

nm...@cam.ac.uk

unread,

Apr 9, 2013, 5:53:36 AM4/9/13

to

In article <rvsc3a-...@ntp-sure.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>

>Using 2N nodes to sort N items does allow a lot more algorithms than
>when you're forced to do all updates in place.

Even when you are, my experience is that a triple pass algorithm
Second pass: convert to a lookup. Third pass: rearrange into order.
It's a lot more cache friendly.

>PS. A long time ago when I needed to sort semi-random record layouts,
>with ascii/binary/fp keys in various locations in each record, I wrote a
>reversible transformation stage that would start by extracting each key
>field (in sort order) and modify it so that it could be sorted in byte
>order, then appending any/all the remaining bytes from the input record.

Actually, with many formats, you could do a lot of compression on
the keys, which would further improve cache friendliness.

Even at an algorithmic level, the naive compsci analyses aren't
realistic.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Apr 9, 2013, 7:39:53 AM4/9/13

to

nm...@cam.ac.uk wrote:
> In article <rvsc3a-...@ntp-sure.tmsw.no>,
> Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>
>> Using 2N nodes to sort N items does allow a lot more algorithms than
>> when you're forced to do all updates in place.
>
> Even when you are, my experience is that a triple pass algorithm
> Second pass: convert to a lookup. Third pass: rearrange into order.
> It's a lot more cache friendly.

Did you (mistakenly) delete a line or two there?

What's the first pass?

>
>> PS. A long time ago when I needed to sort semi-random record layouts,
>> with ascii/binary/fp keys in various locations in each record, I wrote a
>> reversible transformation stage that would start by extracting each key
>> field (in sort order) and modify it so that it could be sorted in byte
>> order, then appending any/all the remaining bytes from the input record.
>
> Actually, with many formats, you could do a lot of compression on
> the keys, which would further improve cache friendliness.

As in sorting ascii fields by extracting a short dictionary of common
substrings, then replace the original strings with pointers into the
dictionary?

>
> Even at an algorithmic level, the naive compsci analyses aren't
> realistic.

Working with blocks of some size (i.e. at least a cache line worth of
data) seems to be a good idea.

Many of the algorithms that make sense today had parallels in the
classic database optimization problems where you tried to minimize the
number of disk blocks needed to locate & access any given record.

nm...@cam.ac.uk

unread,

Apr 9, 2013, 8:00:30 AM4/9/13

to

In article <a9bd3a-...@ntp-sure.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>>
>>> Using 2N nodes to sort N items does allow a lot more algorithms than
>>> when you're forced to do all updates in place.
>>
>> Even when you are, my experience is that a triple pass algorithm
>> Second pass: convert to a lookup. Third pass: rearrange into order.
>> It's a lot more cache friendly.
>
>Did you (mistakenly) delete a line or two there?

Oops, yes. That should have read:

First pass: sort a (contiguous) vector of indices or pointers.

Second pass: convert to a lookup. Third pass: rearrange into order.

>> Actually, with many formats, you could do a lot of compression on
>> the keys, which would further improve cache friendliness.
>
>As in sorting ascii fields by extracting a short dictionary of common
>substrings, then replace the original strings with pointers into the
>dictionary?

I was thinking more of when you know that subfield A is integers in
the range 0-100, or whatever. But your approach works, too, even
for numerics - though usually in the form that you use a compressed
form and have a dictionary of exceptions. Anyway, whatever works.

>> Even at an algorithmic level, the naive compsci analyses aren't
>> realistic.
>
>Working with blocks of some size (i.e. at least a cache line worth of
>data) seems to be a good idea.
>
>Many of the algorithms that make sense today had parallels in the
>classic database optimization problems where you tried to minimize the
>number of disk blocks needed to locate & access any given record.

Yes. Hence my dislike of red-black trees. Blocked B-trees are
a far better idea.

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Apr 9, 2013, 10:29:58 AM4/9/13

to

On Apr 8, 6:26 pm, Robert Wessel <robertwess...@yahoo.com> wrote:
> On Mon, 8 Apr 2013 11:36:39 -0700 (PDT), "Paul A. Clayton"

> <paaronclay...@gmail.com> wrote:
[snip]

> >Robert Wessel, being a regular provider of information
> >about IBM zSeries here (and at the Real World Technologies
> >forum), will probably provide more detailed information
> >shortly. But it looks like I can give a not-slow,
> >not-fast answer before he provides a good one.

[snip explanation of zIIP/zAAP]

Thanks for the correction and substantial explanation.
(I did write that my "answer" was half-fast. :-) Thank
you also for 'redeeming my word' that you would provide
a good answer (even if Nick Maclaren responded correctly
even before I responded [mostly--I did get the right
Architecture!] incorrectly).

It is interesting how software licensing constraints
can influence the design and use of computer systems.

christian.bau

unread,

Apr 10, 2013, 12:51:36 PM4/10/13

to

On Apr 4, 9:15 pm, Ivan Godard <i...@ootbcomp.com> wrote:

> We've been back and forth on the subject of predicting loops. It's clear
> that a decrement-and-branch-if-zero or similar doesn't work for us; it
> doesn't encode well, and offers no gain over the equivalent primitives.

There was one variant of the PowerPC where decrement-and-branch-if-
zero wasn't predicted. "Move to loop count register" had a rather high
latency, but after that the dcbz was executed either like an
unconditional branch, or like no branch on the last iteration. I think
it was a bit later than PPC 7400.

I suppose you could do that with other instructions. If you detect a
conditional branch backwards based on a comparison between a register
that has just been incremented / decremented and another register or
constant, you go into "loop detection" mode. In that mode you check
that there are no other changes to the two registers, you also
calculate in parallel how many iterations there would be if it was
indeed a simple loop. If you get back to the same conditional branch
and you have detected a simple loop, you switch to "loop" mode where
the branch isn't predicted but is actually unconditional or not taken.

Robert Wessel

unread,

Apr 11, 2013, 12:20:40 AM4/11/13

to

On Mon, 08 Apr 2013 19:10:22 -0700, Ivan Godard <iv...@ootbcomp.com>
wrote:

Interesting. Can't say I fully understand all the nuances now, but
the puzzle of UPT and CFC has been significantly reduced.

Still I wonder if those make still sense today - memory performance is
(obviously) quite different now than in 1988.

Terje Mathisen

unread,

Apr 11, 2013, 1:31:31 AM4/11/13

to

Robert Wessel wrote:
> On Mon, 08 Apr 2013 19:10:22 -0700, Ivan Godard <iv...@ootbcomp.com>

>> "The Theory Behind the z/Architecture Sort Assist Instructions"
>>
>> http://www.kcats.org/share/sort/slides/SortAst.pdf
>
>
> Interesting. Can't say I fully understand all the nuances now, but
> the puzzle of UPT and CFC has been significantly reduced.
>
> Still I wonder if those make still sense today - memory performance is
> (obviously) quite different now than in 1988.

That's one of the things we've discussed in another thread: Using a pure
binary tree structure doesn't make quite as much sense in a multi-level
cache environment, where the minimum possible bus transfer is a cache line.

nm...@cam.ac.uk

unread,

Apr 11, 2013, 3:26:52 AM4/11/13

to

In article <jeuh3a-...@ntp-sure.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

>Robert Wessel wrote:
>> On Mon, 08 Apr 2013 19:10:22 -0700, Ivan Godard <iv...@ootbcomp.com>
>>> "The Theory Behind the z/Architecture Sort Assist Instructions"
>>>
>>> http://www.kcats.org/share/sort/slides/SortAst.pdf
>>
>> Interesting. Can't say I fully understand all the nuances now, but
>> the puzzle of UPT and CFC has been significantly reduced.
>>
>> Still I wonder if those make still sense today - memory performance is
>> (obviously) quite different now than in 1988.
>
>That's one of the things we've discussed in another thread: Using a pure
>binary tree structure doesn't make quite as much sense in a multi-level
>cache environment, where the minimum possible bus transfer is a cache line.

They were marginal then - it was one of the sorties in the DFSORT
versus Syncsort wars - I got the conflicting claims from both
camps of salesdroids, and binned both without reading them.

Regards,
Nick Maclaren.