Alpha performance, why?

Brian Catlin

unread,

Feb 4, 2003, 11:36:35 PM2/4/03

to

I've been trying to figure out why the Alpha is so fast. From what I can tell:
1. The instruction set architecture isn't demonstrably better than most other
RISC processors
2. The out-of-order execution isn't unique
3. It doesn't contain a huge number of gates
4. The pipeline isn't very deep
5. The process technology isn't very remarkable

So, where does the performance come from? Why haven't Alpha's competitors been
able to reproduce Digital's astounding success? I assume that the Alpha
architects and engineers have been writing papers (any links to references would
be appreciated) about what they did.

Also, what sort of mistakes did they make? I know about:
1. Not supporting byte-operations initially
2. I heard about some sort of memory synchronization issue

In other words, how could it be improved?

-Brian

Bill Todd

unread,

Feb 5, 2003, 1:07:20 AM2/5/03

to

"Brian Catlin" <bri...@sannas.org> wrote in message
news:b1q48r$ta2$1...@slb9.atl.mindspring.net...

...

> In other words, how could it be improved?

By far the most critical improvement that could have been made would have
been for it to have had an owner interested in capitalizing on its
technology rather than pursuing the Windows (and later Wintel) rose-colored
vision of the future. As for the technical reasons for Alpha's superiority,
Paul DeMone has suggested that it was a thorough, end-to-end (ISA to
silicon) design approach that paid off.

- bill

Terje Mathisen

unread,

Feb 5, 2003, 1:58:42 AM2/5/03

to

Brian Catlin wrote:
> I've been trying to figure out why the Alpha is so fast. From what I can tell:
> 1. The instruction set architecture isn't demonstrably better than most other
> RISC processors

It _is_ somewhat better though, in that it has less hard-to-make-fast
features.

> 2. The out-of-order execution isn't unique

It's pretty good, but still a late addition to the architecture.

> 3. It doesn't contain a huge number of gates
> 4. The pipeline isn't very deep

This is _good_, not bad!

> 5. The process technology isn't very remarkable
>
> So, where does the performance come from? Why haven't Alpha's competitors been

IMHO, synergy between multiple things done (at least nearly) right.

I.e. on current cpus, the effective speed is much less dependent upon
the maximum rate/throughput of any given unit, than on the number of
speedbumps.

"My cpu can wait faster than your cpu!"

is often heard (at least implicitely). :-(

It seems to me that Alpha also suffered, at least to some amount, from
this syndrome, in that it could have a somewhat larger gap between peak
performance and what costumers actually got on production codes.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Anton Ertl

unread,

Feb 5, 2003, 3:38:08 AM2/5/03

to

"Brian Catlin" <bri...@sannas.org> writes:
>1. The instruction set architecture isn't demonstrably better than most other
>RISC processors

No, it is. It does not have branch delay slots (MIPS, SPARC, HPPA),
dedicated registers (hilo on MIPS, condition codes on SPARC, HPPA, ctr
and lr on PPC), or autoincrement addressing (HPPA, PPC).

This simplifies the implementation. The effect on performance is
indirect. Less design effort to get the implementation working leaves
more time for performance enhancements. Also, many performance
enhancements are easier to implement and verify, so more of them can
be done with the same effort.

>Also, what sort of mistakes did they make? I know about:
>1. Not supporting byte-operations initially

Was this really a mistake, performancewise? Supposedly the reason for
adding them was for WNT device drivers (although, given the weaker
memory ordering semantics of Alpha programs, I wonder if this really
increased the number of device drivers working on Alphas by a lot).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Dan Pop

unread,

Feb 5, 2003, 4:34:38 AM2/5/03

to

In <b1q48r$ta2$1...@slb9.atl.mindspring.net> "Brian Catlin" <bri...@sannas.org> writes:

>I've been trying to figure out why the Alpha is so fast.

Most likely, because it was the last major RISC architecture designed,
so its designers could learn from the experience of the other RISC
architectures preceding it. The same fact also explains its commercial
failure: when it finally arrived, the market was already taken by the
other architectures and few people want to switch to a new architecture
simply because it is *somewhat* faster.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan...@ifh.de

Ketil Malde

unread,

Feb 5, 2003, 5:04:35 AM2/5/03

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

>> Also, what sort of mistakes did they make? I know about:
>> 1. Not supporting byte-operations initially

> Was this really a mistake, performancewise?

For some applications, like accessing an 8-bit frame buffer?

I seem to remember Doom on Alpha Linux, which normally would benefit
from using DGA (Direct Graphics Access, IIRC), not getting any
advantage from it on Alpha for this reason.

(It's been a while, though)

-kzm
--
If I haven't seen further, it is by standing in the footprints of giants

TOUATI Sid

unread,

Feb 5, 2003, 9:25:31 AM2/5/03

to

BrA very important thing that you forget is the efficiency of the
compiler. Alpha compiler team generally does a good job. I tested their
compiler, and it seems that it is clevrer than others.

SAAT

Nick Maclaren

unread,

Feb 5, 2003, 9:29:53 AM2/5/03

to

Hmm. The latter was certainly not true in any of the areas I am
close to, but that is partially because they tend to rely on
portable applications. The causes were different.

I don't think that the former was true, either, but was probably
because it was designed with speed as the primary objective, rather
than with compactness or "RISC purity" as objectives. Also, as I
have posted before, good performance comes from good design and not
the use of any particular gimmick.

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679

Nick Maclaren

unread,

Feb 5, 2003, 9:31:18 AM2/5/03

to

In article <egznpb9...@sefirot.ii.uib.no>,

More importantly, it is needed to support the POSIX and OpenMP
models. Performance is not the only requirement for such
operations.

Peter Boyle

unread,

Feb 5, 2003, 10:20:32 AM2/5/03

to

On Wed, 5 Feb 2003, Terje Mathisen wrote:

> Brian Catlin wrote:
> > I've been trying to figure out why the Alpha is so fast. From what I can tell:
> > 1. The instruction set architecture isn't demonstrably better than most other
> > RISC processors
>
> It _is_ somewhat better though, in that it has less hard-to-make-fast
> features.
>

[snip]

> IMHO, synergy between multiple things done (at least nearly) right.

Yup.
A number of other things tend to get neglected, such as the memory
system has typically been a number of years ahead of everyone else.
As evidenced by:

Integrated L2 on 21164.

Bigger cache lines (64byte), and more off chip references (8) than
everything else of the same timeframe for the 21264.
Also the point-to-point system bus, subsequently adopted by AMD.
[IMHO, the 21264 gains in the
external memory system performance may have added more to the specFP than
the OoO execution - OoO makes more of a difference when the compiler
is bad and doesn't software pipeline, and 21164 from L3 (off-chip) cache
near as dammit matches 21264 from L2 (off-chip) cache on my codes]

If 21364 had been on time, it would have been the first with onchip
memory controllers. As it is, 21364 will still be extraordinarily
aggressive.

> I.e. on current cpus, the effective speed is much less dependent upon
> the maximum rate/throughput of any given unit, than on the number of
> speedbumps.

Exactly. As evidenced by 21264's net decrease (clocks got slower) in peak
FP vs 21164 accompanied by the net 2x increase in sustained performance.

> It seems to me that Alpha also suffered, at least to some amount, from
> this syndrome, in that it could have a somewhat larger gap between peak
> performance and what costumers actually got on production codes.

On 21064 and 21164 yes. Nowadays, I doubt it. Compare "peak
to bandwidth" and "peak to specFP" ratios of either 21264/21364 with
Itanium2 and power4+.

Peter

> Terje
>
> --
> - <Terje.M...@hda.hydro.com>
> "almost all programming can be viewed as an exercise in caching"
>
>

Peter Boyle pbo...@physics.gla.ac.uk

Bill Todd

unread,

Feb 5, 2003, 6:43:24 PM2/5/03

to

"Peter Boyle" <pbo...@holyrood.ed.ac.uk> wrote in message
news:Pine.GSO.4.33.030205...@holyrood.ed.ac.uk...

>
> On Wed, 5 Feb 2003, Terje Mathisen wrote:

...

> > It seems to me that Alpha also suffered, at least to some amount, from
> > this syndrome, in that it could have a somewhat larger gap between peak
> > performance and what costumers actually got on production codes.
>
> On 21064 and 21164 yes. Nowadays, I doubt it. Compare "peak
> to bandwidth" and "peak to specFP" ratios of either 21264/21364 with
> Itanium2 and power4+.

Well, the 21264 recently got a lot peakier on SPECfp than it had previously
been, due to more aggressive compiler (feedback) optimization techniques,
and the 21364 seems somewhat similar in that regard. Certainly peakier than
POWER4, IIRC (can't compare it with Itanic, because they don't publish peak
values - I've suspected because they apply about all the possible
optimization to the base score by dint of SPEC-sensitive compiler switch
design).

- bill

Peter da Silva

unread,

Feb 5, 2003, 7:53:55 PM2/5/03

to

In article <b1qcn4$ioh$1...@vkhdsu24.hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> > 4. The pipeline isn't very deep

> This is _good_, not bad!

Heh.

When we got the Itanium Presentation from DEC, they went on about the
cool long pipeline. They couldn't explain quite why it was a good thing,
but they sure thought it was cool.

Long pipelines seem to be an Intel trademark mistake. Look at XScale. :)

--
I've seen things you people can't imagine. Chimneysweeps on fire over the roofs
of London. I've watched kite-strings glitter in the sun at Hyde Park Gate. All
these things will be lost in time, like chalk-paintings in the rain. `-_-'
Time for your nap. | Peter da Silva | Har du kramat din varg, idag? 'U`

Jan C. Vorbrüggen

unread,

Feb 6, 2003, 5:32:55 AM2/6/03

to

> If 21364 had been on time, it would have been the first with onchip
> memory controllers.

Ah yes - only about fifteen years or so after the T414 became available.

Of course, those memory controllers do cache coherence, which the T414 et al.
didn't. But then, they didn't need to. And they even already had the serial
line comms integrated on the chip the 21364 is so proud of...

Jan

Rupert Pigott

unread,

Feb 6, 2003, 8:01:58 AM2/6/03

to

"Jan C. Vorbrüggen" <jvorbr...@mediasec.de> wrote in message
news:3E4239D7...@mediasec.de...

I always wanted a high-end transputer. :)

Cheers,
Rupert

Peter Boyle

unread,

Feb 6, 2003, 9:47:29 AM2/6/03

to

True, should have said first of the current rash of "mainstream"
processor integration. i.e. Hammer, Power4, Mips, and others in the pipe.

Peter
> Jan
>

Peter Boyle

unread,

Feb 6, 2003, 10:06:33 AM2/6/03

to

On Wed, 5 Feb 2003, Bill Todd wrote:

> "Peter Boyle" <pbo...@holyrood.ed.ac.uk> wrote in message
> news:Pine.GSO.4.33.030205...@holyrood.ed.ac.uk...
> >
> > On Wed, 5 Feb 2003, Terje Mathisen wrote:
>
> ...
>
> > > It seems to me that Alpha also suffered, at least to some amount, from
> > > this syndrome, in that it could have a somewhat larger gap between peak
> > > performance and what costumers actually got on production codes.
> >
> > On 21064 and 21164 yes. Nowadays, I doubt it. Compare "peak
> > to bandwidth" and "peak to specFP" ratios of either 21264/21364 with
> > Itanium2 and power4+.
>
> Well, the 21264 recently got a lot peakier on SPECfp than it had previously
> been, due to more aggressive compiler (feedback) optimization techniques,

I don't mean "peak" as in SPECfp peak vs. base.

I mean "peak" as in the hard theoretical bound on number of flops per
second if you use every functional unit it's got on every single cycle.
A 1GHz 21364 has 2Gflop peak, while an identically clocked power4 has
4Gflop peak, and an itanium has 8Gflop peak (AFAIR in single?). The 21364
is fastest on SPECfp peak as far as I recall. So the
delivered/peak ratio obtained by 21364 is significantly better for alpha
than the competition.

BTW, my conclusion is that the 21364 floating point is somewhat anemic for
their wonderful memory system - which is crying out for EV8 (RIP).

I have a real application that sustains >50% on EV6 let alone EV7 in
single precision, and key assembler FP kernels that run at ~80% _from
out of cache_. For this (very tuned) app EV7 adds relatively little over
EV6.

Peter

> and the 21364 seems somewhat similar in that regard. Certainly peakier
> than POWER4, IIRC (can't compare it with Itanic, because they don't
> publish peak values - I've suspected because they apply about all the possible
> optimization to the base score by dint of SPEC-sensitive compiler switch
> design).
>
> - bill
>
>
>
>

Peter Boyle pbo...@physics.gla.ac.uk

Peter L. Montgomery

unread,

Feb 6, 2003, 2:03:21 PM2/6/03

to

In article <b1q48r$ta2$1...@slb9.atl.mindspring.net>
"Brian Catlin" <bri...@sannas.org> writes:

>Also, what sort of mistakes did they make? I know about:
>1. Not supporting byte-operations initially
>2. I heard about some sort of memory synchronization issue
>
>In other words, how could it be improved?
>
> -Brian

The Alpha had CMPLE and CMPLT instructions.
Either CMPLE should have been CMPGE, or CMPLT should have been CMPGT.

CMPLE(x, y) returned 1 if x <= y and 0 if x > y.
CMPLT(x, y) returned 1 if x < y and 0 if x >= y.
The first operand to CMPLE or CMPLT must be an integer register.
The second operand can be an integer register or an immedate constant.

If r1 and r2 are registers, we can test whether r1 >= r2 by rewriting
this as r2 <= r1 and using CMPLE(r2, r1).
There is no such way to rewrite r1 >= 5, whereas we
can test r1 <= 5 as CMPLE(r1, 5) or CMPLT(r1, 6).

[BTW, I would prefer the compares to return -1 or 0
rather than 1 or 0, but that is orthogonal to this point.]

The scaled add and subtract allowed one to compute
4*x +- y and 8*x +- y. I would prefer the subtracts
give -4*x + y and -8*x + y. This would allow subtraction
from a pointer (or negative indexing)
when sizeof(data type) = 4 or 8.
More generally, they instructions would give
+-(4*x or 8*x) + (y or immediate).

Overall, the Alpha designers did very well.
Few architectures let us compute both the upper
half and lower half of an integer product,
in non-dedicated integer registers.

--
A local drug store selling wine boasts a drug and alcohol free workplace.
A local grocery store advertises Hot Buys even on frozen foods.
Peter-Lawren...@cwi.nl Home: San Rafael, California
Microsoft Research and CWI

Peter Boyle

unread,

Feb 6, 2003, 2:19:52 PM2/6/03

to

I've also received a correction from Rick Jones at HP.
There was a PA 7100LC ("low cost") in the early-mid 90's.
No doubt there are others, too!
Peter

> Peter
> > Jan
> >
>
>
>
>
>
>

Peter Boyle pbo...@physics.gla.ac.uk

Eric Smith

unread,

Feb 6, 2003, 8:18:30 PM2/6/03

to

Peter Boyle <pbo...@holyrood.ed.ac.uk> writes:
> If 21364 had been on time, it would have been the first with onchip
> memory controllers. As it is, 21364 will still be extraordinarily
> aggressive.

The first *what* with onchip memory controllers? It wouldn't have even
been the first Alpha microprocessor with onchip memory controllers, since
that was the 21066/21068.

Will R

unread,

Feb 8, 2003, 9:24:55 PM2/8/03

to

> Overall, the Alpha designers did very well.
>Few architectures let us compute both the upper
>half and lower half of an integer product,
>in non-dedicated integer registers.

I'm not smart enough to know exactly why that's really important, but IMHO, the
reason Alpha is fast isn't because of any one thing done right, but rather the
fact that they didn't do anything horribly wrong.
For example, when folks talk about IA64, they have tons of complaints. When
they talk about MIPS, they say, "nice architecture, but it doesn't clock high
enough." P4's are, "pretty fast, but such a long pipeline." G4's are, "They
cripple an adequate CPU with a 100 MHz bus." With Athlon's, it's, "decent
chip, a shame it's x86."

Alpha may not have done anything stunningly well. The OoO isn't unique, or
stunningly better than similar architectures, etc. The DEC guys just had the
stunning genius to not srew up anything very badly.

Really, the only thing people say about Alpha is, "Damned sexy. Shame they
didn't market it." I don't know why a dead, undermarketed CPU has such a
fanatical following, but it does... I want one.
------------------
Woooogy
I have to go back in time to pretend to be myself when I tell myself to tell
myself, because I don't remember having been told by myself to tell myself. I
love temporal mechanics.

Mike Nguyen

unread,

Feb 8, 2003, 11:13:51 PM2/8/03

to

In article <20030208212455...@mb-cr.aol.com>, Will R wrote:
> Really, the only thing people say about Alpha is, "Damned sexy. Shame they
> didn't market it." I don't know why a dead, undermarketed CPU has such a
> fanatical following, but it does... I want one.

The first law of comp.arch: eventually, all threads talk about the Alpha or
DEC.

Mike.

Peter da Silva

unread,

Feb 9, 2003, 12:43:07 AM2/9/03

to

In article <3Ak1a.4028552$6N5.5...@post-03.news.easynews.com>,

Well, particularly if the thread started out being about Alpha and DEC.

Niels Jørgen Kruse

unread,

Feb 9, 2003, 7:21:23 AM2/9/03

to

I artiklen <H9wI9...@cwi.nl> , "Peter L. Montgomery"
<Peter-Lawren...@cwi.nl> skrev:

> The Alpha had CMPLE and CMPLT instructions.
> Either CMPLE should have been CMPGE, or CMPLT should have been CMPGT.
>
> CMPLE(x, y) returned 1 if x <= y and 0 if x > y.
> CMPLT(x, y) returned 1 if x < y and 0 if x >= y.
> The first operand to CMPLE or CMPLT must be an integer register.
> The second operand can be an integer register or an immedate constant.
>
> If r1 and r2 are registers, we can test whether r1 >= r2 by rewriting
> this as r2 <= r1 and using CMPLE(r2, r1).
> There is no such way to rewrite r1 >= 5, whereas we
> can test r1 <= 5 as CMPLE(r1, 5) or CMPLT(r1, 6).

You could test r1 < 5 and reverse the meaning of 0, 1 downstream. The
inversion could be folded away in many cases.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Jouni Osmala

unread,

Feb 9, 2003, 10:02:40 AM2/9/03

to

> Really, the only thing people say about Alpha is, "Damned sexy. Shame they
> didn't market it." I don't know why a dead, undermarketed CPU has such a
> fanatical following, but it does... I want one.
> ------------------

Oh well. I find a retail site with google that sells EV56 systems for
$599.
(cripled...)
And XP1000 EV6 for $2599 (500mhz +1Gb ram +32mb gfx +scsi 9Gb)
(upgrade to 600 for 200$)

Also from some other sites with reasonably priced systems..

I think that maybe reachable for most. Unfortunately, those systems
are probably surpassed by todays PC:s with 4 times the clock speed.

And if you wan't real performance alpha, you either need to spend a
LOT or you just could get faster dual/quad Xeon system for less...

Or if you wanted just CPU for showcase you can find one on ebay...

Jouni Osmala
Helsinki University of Technology

Mike Nguyen

unread,

Feb 9, 2003, 12:40:59 PM2/9/03

to

In article <b24ppb$h...@web.eng.baileynm.com>, Peter da Silva wrote:
> Well, particularly if the thread started out being about Alpha and DEC.

True, but I've been lurking for years, and _ALL_ threads seem to end up
talking about Alpha and DEC sooner or later :-)

Mike.

Will R

unread,

Feb 9, 2003, 8:32:43 PM2/9/03

to

>
>The first law of comp.arch: eventually, all threads talk about the Alpha or
>DEC.
>
>Mike.
>

Well, when the thread has "Alpha" in the subject line, I think it's to be
expected! :)

Peter da Silva

unread,

Feb 9, 2003, 10:21:44 PM2/9/03

to

In article <Low1a.4059956$6N5.5...@post-03.news.easynews.com>,

You haven't been lurking long enough if you don't remember them all turning
into discussions about the Amiga.

gokrix

unread,

Feb 10, 2003, 12:27:09 PM2/10/03

to

pe...@abbnm.com (Peter da Silva) wrote in message news:<b275s8$1...@web.eng.baileynm.com>...

> In article <Low1a.4059956$6N5.5...@post-03.news.easynews.com>,
> Mike Nguyen <nguye...@mac.com> wrote:
> > In article <b24ppb$h...@web.eng.baileynm.com>, Peter da Silva wrote:
> > > Well, particularly if the thread started out being about Alpha and DEC.
>
> > True, but I've been lurking for years, and _ALL_ threads seem to end up
> > talking about Alpha and DEC sooner or later :-)
>
> You haven't been lurking long enough if you don't remember them all turning
> into discussions about the Amiga.

These days there is another candidate: Gigabit ethernet and its
merits/demerits as a cluster interconnect.

--GS

M. Ranjit Mathews

unread,

Feb 11, 2003, 1:13:06 AM2/11/03

to

How about Gigabit interconnected in a Myrinet fashion?

George Coulouris

unread,

Feb 11, 2003, 9:07:58 AM2/11/03

to

I think Emulex (nee Giganet) was trying that (roughly) with VI over
IP, but I believe that product has been EOLd.

I heard some talk at SC2002 about using some particular gigabit card
with IP offload that got latency down to 10us, but I don't know if
they support fat tree topologies, and the name of the particular
vendor escapes me.

--
George Coulouris
lastname at nih dot gov
not speaking for ncbi

Greg Lindahl

unread,

Feb 11, 2003, 1:31:13 PM2/11/03

to

In article <3E4889CD...@yahoo.com>,

M. Ranjit Mathews <ranjit_...@yahoo.com> wrote:

>> These days there is another candidate: Gigabit ethernet and its
>> merits/demerits as a cluster interconnect.
>
>How about Gigabit interconnected in a Myrinet fashion?

1. Unless a vendor has extended things, gigabit only plugs together in
very particular ways, nowhere as nice as arbitrary routing. I do know
of at least one gigE switch chip vendor which allows more interesting
routing, but it remains to be seen if end-users can buy a huge gigE
switch for reasonable $$.

2. All gigabit switches are store-and-forward, which means you have 15
usec of latency per switch for a 1500 byte packet. If you have 10 gig
hops in the middle of your network they are only 1.5 usec each, but
still...

3. Gigabit cards and chipsets mutate rather rapidly, which means you'll
be constantly rewriting your low-latency driver.

-- greg

M. Ranjit Mathews

unread,

Feb 11, 2003, 8:08:43 PM2/11/03

to

lin...@pbm.com (Greg Lindahl) wrote ,

Imagine, if you will, a rack with 8 5U units, with each unit
accomodating 8 blade servers. Now, the rack has 64 blade servers. Put
a 16 port switch into each of the 8 units. With 8 ports connected to 8
blade servers and 8 ports going out from the switch to the back of the
unit, we have the first level of a Myrinet like tree. Now, put a 2U
switch with 8x8=64 ports in the middle of the rack, with the switch
being internally wired in a Myrinet fashion, to accomplish which would
require a number of Gigabit switches inside it. Run cables from the 64
ports at the back of the 8 5U units to these 64 ports.

Latency, as you have pointed out, would be on the high side, but there
might be many workloads that are not affected adversely by this. To
make the system more reliable, duplicate the ports and switches so
that each blade has two Gigabit ports going to two switches internal
to the unit holding the blade, with 16 ports at the back of each unit,
giving a total of 128 ports. These would be switched by two
Myrinet-like switches, each being 2U high and 9 1/2" wide (half of
19"), and located in the middle of the rack.

> -- greg

Jan C. Vorbrüggen

unread,

Feb 12, 2003, 9:02:24 AM2/12/03

to

> 2. All gigabit switches are store-and-forward, which means you have 15
> usec of latency per switch for a 1500 byte packet.

Spit! Why that? Wormhole routing has been in products for how many decades?

Jan

Greg Lindahl

unread,

Feb 12, 2003, 10:46:55 AM2/12/03

to

In article <3E4A53F0...@mediasec.de>,

I suspect the main reason is that the average customer doesn't give a
crap about very low latency. Cut-through switches were available for
ethernet and fast ethernet, but in those cases the store-and-forward
latency (150us for fast ethernet) was high enough that customers
cared.

-- greg

Michael S

unread,

Feb 12, 2003, 4:36:07 PM2/12/03

to

lin...@pbm.com (Greg Lindahl) wrote in message news:<3e4a6beb$1...@news.meer.net>...

Cut-through switches _were_ available for ethernet and Fast Ethernet.
We talked with Broadcom representative yesterday. He claims that all
newer Fast Ethernet swithches are store-and-forward.

Terje Mathisen

unread,

Feb 12, 2003, 5:01:20 PM2/12/03

to

Michael S wrote:
> Cut-through switches _were_ available for ethernet and Fast Ethernet.
> We talked with Broadcom representative yesterday. He claims that all
> newer Fast Ethernet swithches are store-and-forward.

If so, then it might be because they all seem to support dual speed and
half/full duplex as well.

In particular, any speed shift requires s&f.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Michael S

unread,

Feb 13, 2003, 9:07:21 AM2/13/03

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote in message news:<b1qcn4$ioh$1...@vkhdsu24.hda.hydro.com>...
> Brian Catlin wrote:
> > I've been trying to figure out why the Alpha is so fast. From what I can tell:
>
> > 3. It doesn't contain a huge number of gates

> > 4. The pipeline isn't very deep
>
> This is _good_, not bad!
>

Why do you think that not very deep pipeline is a good idea ?
Looking at the latest SPEC number very deep pipeline (P-IV) is the
hand down winner.

I would like to compare all high-end CPUs on SPEC/sq.mm, SPEC/MGates,
SPEC/$ and SPEC/Wt. Unfortunately I don't know where to find the
required data.
Doesn't P-IV score the best or near the best marks in all categories
both for SPECINT and for SPECFp ? That despite not being tuned for
SPECFp relatively to the Alpha, Itanium or Power4 (Of course it is
more tuned for Fp than Athlon). Isn't its deeeeeep pipeline the main
reason for the good scores ?

Attention. Now I am going to say something I have no clue about.
One. Two. Three...
The decrease in the ratio of dynamic power to leakage power
consumption makes higher clock rates and deeper pipelines more and
more competitive. Even for the SPEC/Wt marks.
Ugh...

Bernd Paysan

unread,

Feb 13, 2003, 9:38:40 AM2/13/03

to

Michael S wrote:
>> > 3. It doesn't contain a huge number of gates
>> > 4. The pipeline isn't very deep
>>
>> This is _good_, not bad!
>>
> Why do you think that not very deep pipeline is a good idea ?
> Looking at the latest SPEC number very deep pipeline (P-IV) is the
> hand down winner.

No, it's the high clock rate that lets it win. If you get a good clock rate
from a much shorter pipeline, this is good, not bad. EV7 now is 0.18u, and
it definitely beats 0.18u Pentium 4s though having a smaller core and a
shorter pipeline (the peripheral part, the larger L2, memory controllers,
ccNUMA links, cache coherence dictionaries and all that stuff make it a
larger chip, though).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Terje Mathisen

unread,

Feb 13, 2003, 10:23:37 AM2/13/03

to

Michael S wrote:
> Terje Mathisen <terje.m...@hda.hydro.com> wrote in message news:<b1qcn4$ioh$1...@vkhdsu24.hda.hydro.com>...
>
>>Brian Catlin wrote:
>>
>>>I've been trying to figure out why the Alpha is so fast. From what I can tell:
>>
>>>3. It doesn't contain a huge number of gates
>>>4. The pipeline isn't very deep
>>
>>This is _good_, not bad!
>>
> Why do you think that not very deep pipeline is a good idea ?
> Looking at the latest SPEC number very deep pipeline (P-IV) is the
> hand down winner.

The shorter the pipeline, the lower your branch miss penalty.

I'm willing to bet that the Intel architects agonized over every
additional pipeline stage they had to add, only doing so when simulation
showed that across their benchmark applications, it would be a net win.

However, if you can keep the same clock freq and use less pipe stages,
you'll be significantly faster on twisty/unpredictable codes.

Peter da Silva

unread,

Feb 13, 2003, 11:15:10 AM2/13/03

to

In article <f881b862.03021...@posting.google.com>,

Michael S <already...@yahoo.com> wrote:
> Why do you think that not very deep pipeline is a good idea ?

A deeper pipeline generally allows you to run with a higher clock speed
with a given process. That's the whole point: you simplify the stages,
they have less work to do, they run faster.

The problem is that with a 10-stage pipeline instead of a 6-stage pipeline,
if you have a branch miss you're out of action for 10 cycles instead of 6.

The bottom line is, a longer pipeline lets you increase the clock speed,
but decreases the overall performance per clock. And higher clock speed
has its own problems.

Bernd Paysan

unread,

Feb 13, 2003, 12:09:39 PM2/13/03

to

Terje Mathisen wrote:
> I'm willing to bet that the Intel architects agonized over every
> additional pipeline stage they had to add, only doing so when simulation
> showed that across their benchmark applications, it would be a net win.

Maybe. Or maybe the marketing manager said "Look, we lost the GHz race to
AMD. Make sure that the Pentium 4 will win the GHz race. There's an easy
trick: Remember that the technical people at AMD won't extend the pipeline
further down when it shows that accross their benchmark application it will
be a loss. So go that far, and we'll always be ahead of AMD in terms of
GHz. GHz sells, performance is not that important."

Sander Vesik

unread,

Feb 13, 2003, 1:07:15 PM2/13/03

to

Michael S <already...@yahoo.com> wrote:
> Terje Mathisen <terje.m...@hda.hydro.com> wrote in message news:<b1qcn4$ioh$1...@vkhdsu24.hda.hydro.com>...
>> Brian Catlin wrote:
>> > I've been trying to figure out why the Alpha is so fast. From what I can tell:
>>
>> > 3. It doesn't contain a huge number of gates
>> > 4. The pipeline isn't very deep
>>
>> This is _good_, not bad!
>>
>
> Why do you think that not very deep pipeline is a good idea ?
> Looking at the latest SPEC number very deep pipeline (P-IV) is the
> hand down winner.

If you reduced the pipeline length on P4 while leaving it otherwise the
same its perfomace would probably go up. Long pipeline is a means to
an end (high clock rate).

To see why deep pipeline might not be good compare P4 and P3 at the same
or similar frequency, or take a look at the frequency where P4 passes P3
in perfomance.

--
Sander

+++ Out of cheese error +++

Mike Haertel

unread,

Feb 13, 2003, 1:22:17 PM2/13/03

to

In article <b2ggae$q...@web.eng.baileynm.com>, Peter da Silva wrote:
> The problem is that with a 10-stage pipeline instead of a 6-stage pipeline,
> if you have a branch miss you're out of action for 10 cycles instead of 6.

Curiously enough, on the Pentium 4 the length of the pipeline is *not*
always the dominant cost of a branch mispredict. In fact it's often
enough not the dominant factor that adding more pipestages to the
Pentium 4 pipeline would hurt performance less than you'd expect.
(That said, it's still bad to add pipe stages.)

In the Pentium 4, the dominant cost of a branch mispredict often
comes from out-of-order execution: the processor core is executing
even the branches out-of-order, and when a branch mispredict occurs,
the processor has to wait for older branches to get out of the way
before it can commit to the mispredicting branch. This often takes
quite a long time if the older branches are depending on cache misses.
--
Mike Haertel <mi...@ducky.net>

Nick Maclaren

unread,

Feb 13, 2003, 1:31:55 PM2/13/03

to

In article <slrnb4noo...@ducky.net>,

Mike Haertel <mi...@ducky.net> writes:
|>
|> In the Pentium 4, the dominant cost of a branch mispredict often
|> comes from out-of-order execution: the processor core is executing
|> even the branches out-of-order, and when a branch mispredict occurs,
|> the processor has to wait for older branches to get out of the way
|> before it can commit to the mispredicting branch. This often takes
|> quite a long time if the older branches are depending on cache misses.

Now, that's taking out-of-order execution fairly seriously!

Most interesting.

It is obviously not impossible, in theory, to recommit without
waiting - but that starts to get quite hairy quite fast ....

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679

Terje Mathisen

unread,

Feb 13, 2003, 2:11:37 PM2/13/03

to

Nick Maclaren wrote:
> In article <slrnb4noo...@ducky.net>,
> Mike Haertel <mi...@ducky.net> writes:
> |>
> |> In the Pentium 4, the dominant cost of a branch mispredict often
> |> comes from out-of-order execution: the processor core is executing
> |> even the branches out-of-order, and when a branch mispredict occurs,
> |> the processor has to wait for older branches to get out of the way
> |> before it can commit to the mispredicting branch. This often takes
> |> quite a long time if the older branches are depending on cache misses.
>
> Now, that's taking out-of-order execution fairly seriously!
>
> Most interesting.
>
> It is obviously not impossible, in theory, to recommit without
> waiting - but that starts to get quite hairy quite fast ....

Can you spell 'combinatorial explosion'? :-)

(I'm not sure I can, but I know that there are some ugly exponentials in
the amount of hw resources needed.)

Nick Maclaren

unread,

Feb 13, 2003, 3:12:19 PM2/13/03

to

In article <b2gqla$p8m$1...@vkhdsu24.hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>>
>> It is obviously not impossible, in theory, to recommit without
>> waiting - but that starts to get quite hairy quite fast ....
>
>Can you spell 'combinatorial explosion'? :-)

As in "quite hairy quite fast"? This is one of the reasons that I
don't believe that branch prediction can deliver what many people
claim that it will do. I don't know of any way of handling multiple
branches that doesn't have some such problem.

Michael S

unread,

Feb 13, 2003, 5:42:12 PM2/13/03

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote in message news:<b2gd9p$3v8$1...@vkhdsu24.hda.hydro.com>...

> Michael S wrote:
> > Terje Mathisen <terje.m...@hda.hydro.com> wrote in message news:<b1qcn4$ioh$1...@vkhdsu24.hda.hydro.com>...
> >
> >>Brian Catlin wrote:
> >>
> >>>I've been trying to figure out why the Alpha is so fast. From what I can tell:
>
> >>>3. It doesn't contain a huge number of gates
> >>>4. The pipeline isn't very deep
> >>
> >>This is _good_, not bad!
> >>
> > Why do you think that not very deep pipeline is a good idea ?
> > Looking at the latest SPEC number very deep pipeline (P-IV) is the
> > hand down winner.
>
> The shorter the pipeline, the lower your branch miss penalty.
>

Only partially true. The pipeline stages beyond the point of the
evaluation of the branch condition does not necessarily increase a
branch miss penalty. The case is very often true for numeric codes.

> I'm willing to bet that the Intel architects agonized over every
> additional pipeline stage they had to add, only doing so when simulation
> showed that across their benchmark applications, it would be a net win.
>
> However, if you can keep the same clock freq and use less pipe stages,
> you'll be significantly faster on twisty/unpredictable codes.
>

That's exactly a point. Deep pipline is the best mean for increasing a
clock rate !

> Terje

Michael S

unread,

Feb 13, 2003, 6:39:33 PM2/13/03

to

Sander Vesik <san...@haldjas.folklore.ee> wrote in message news:<1045159...@haldjas.folklore.ee>...

The right comparison is on the similar semiconductor technology, not
at the similar fequency. Comparison like this shows that P4 is
significantly faster than P3.

What other mean do you have for increasing clock rate ?

Let's assume 10 stage pipeline. With average 2 instructions/clock, 1
branch in 6 instructions and 90% branch prediction success rate we
have one branch miss every 60 instructions. With the branch miss
penalty of 9 clocks thease 60 instructions take 39 clocks to execute.
Now let's add one pipeline stage. Now miss penalty is 10 clocks and
the total execution time is 40 clocks. We have a net loss of 2.5%.
It's more than likely that with the additional pipeline stage we could
gain significantly more than 2.5% in clock rate.
Of coarse I simplified things down. However, it shows the point - if
the penalty of mispredicted branches is the only problem of the long
pipeline then the increase in clock frequency more than compensate
for this penalty.
IMO, increased instructions latency is a more severe problem at least
for reasonably iterative codes.

Paul DeMone

unread,

Feb 14, 2003, 1:19:30 AM2/14/03

to

Michael S wrote:
>
> Terje Mathisen <terje.m...@hda.hydro.com> wrote in message news:<b1qcn4$ioh$1...@vkhdsu24.hda.hydro.com>...
> > Brian Catlin wrote:
> > > I've been trying to figure out why the Alpha is so fast. From what I can tell:
> >
> > > 3. It doesn't contain a huge number of gates
> > > 4. The pipeline isn't very deep
> >
> > This is _good_, not bad!
> >
>
> Why do you think that not very deep pipeline is a good idea ?
> Looking at the latest SPEC number very deep pipeline (P-IV) is the
> hand down winner.

No one has disclosed the SPEC scores of an aggressive short pipe
design built in a 0.13 um semi technology yet. Instead compare
McKinley and EV68 SPEC numbers to that of the 0.18 um P4, the
Willamette.

--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
pde...@igs.net architectures with MIPSed results but ALPHA's well
that ends well.

Paul DeMone

unread,

Feb 14, 2003, 1:34:22 AM2/14/03

to

Sander Vesik wrote:
[..]

> If you reduced the pipeline length on P4 while leaving it otherwise the

By "leaving it otherwise" do you mean running at the same maximum
frequency in a given technology? That's quite a trick.

> same its perfomace would probably go up. Long pipeline is a means to
> an end (high clock rate).

Yes, a long pipeline is a means to attain high clock rates. But that
isn't the goal, high performance is. Compare the SPEC scores for the
fastest clocked PIII and P4 in the same process technology. The P4
outperforms PIII in 0.18 um for example by arguably a larger factor
than the P6 core outperformed the Pentium core.

>
> To see why deep pipeline might not be good compare P4 and P3 at the same
> or similar frequency

This is a meaningless comparison. It is like comparing wing
performance of a C130 and F15 at 200 knots instead of at the
two aircraft's respective typical cruising speed.

> or take a look at the frequency where P4 passes P3
> in perfomance.

Which is a lot lower than the P4 is capable of in a given
technology which is why the P4 is a higher performance
design.

Roberto Maria Avanzi Mocenigo

unread,

Feb 14, 2003, 3:40:51 AM2/14/03

to

Sander Vesik wrote:
>
> To see why deep pipeline might not be good compare P4 and P3 at the same
> or similar frequency, or take a look at the frequency where P4 passes P3
> in perfomance.

I have some code that on a 1.5Ghz P4 performs exactly on par with a
600Mhz P3. And this code is extremely well optimized. Maybe this is
an extreme case, but still...

Roberto
--
/_/ Roberto Maria Avanzi (a.k.a. Mocenigo) <><
_/ Institut fuer Experimentelle Mathematik / Uni Essen
/ "The thing's hollow, and, oh my God! It's full of stars!"

Jan C. Vorbrüggen

unread,

Feb 14, 2003, 3:52:38 AM2/14/03

to

> In particular, any speed shift requires s&f.

In general, yes - for the various Ethernets, no. Or at least, when you're
sending to a same or lower speed channel, you can wormhole, and only fall
back on s&f when sending to a faster channel. In any case, you need buffers
on the input side (access to the switch matrix won't be deterministic) -
but you don't _have_ to wait until all the data is there.

Jan

Kai Harrekilde-Petersen

unread,

Feb 14, 2003, 4:33:02 AM2/14/03

to

If you can accept sending out packets with FCS errors, go ahead, make
it cut-through (or wormhole, as you call it).

But you cannot update your MAC address table until you know whether
the packet was OK - which means waiting to find out if the FCS was
correct or not. Essentially the same problem as with committing OoO
results into the register file on a normal CPU.

Note that cut-through isn't going to reduce the latency very much for
the non-higest speeds; the internal transmission time is fixed,
regardless of the interface speed.

For example, suppose you have a 16 port tri-speed switch, and day that
the internal bandwidth is 32Gbps, then the internal transfer time will
be on the order of 380nsec(*) for a 1518 byte packet, regardless of
the interface speed.

(*) Depending on the internal architecture you have to add a bit extra
for the address lookup, queueing, priority stuff etc. Say 600-800nsec
total.

Kai
--
Kai Harrekilde-Petersen <k...@vitesse.com> Opinions are mine...

Jan C. Vorbrüggen

unread,

Feb 14, 2003, 5:53:57 AM2/14/03

to

> But you cannot update your MAC address table until you know whether
> the packet was OK - which means waiting to find out if the FCS was
> correct or not. Essentially the same problem as with committing OoO
> results into the register file on a normal CPU.

So you speculatively update the table...

> Note that cut-through isn't going to reduce the latency very much for
> the non-higest speeds; the internal transmission time is fixed,
> regardless of the interface speed.

The whole point of cut-through is to get the transmission on the outgoing
port started while the data is still streaming in on the ingoing port.
Internal delays are always considered negligible.

Jan

Terje Mathisen

unread,

Feb 14, 2003, 11:03:20 AM2/14/03

to

Roberto Maria Avanzi Mocenigo wrote:
> Sander Vesik wrote:
>
>>
>> To see why deep pipeline might not be good compare P4 and P3 at the same
>> or similar frequency, or take a look at the frequency where P4 passes P3
>> in perfomance.
>
>
> I have some code that on a 1.5Ghz P4 performs exactly on par with a
> 600Mhz P3. And this code is extremely well optimized. Maybe this is
> an extreme case, but still...

That should be relatively easy to achieve for anything which is memory
latency limited, like a pointer chasing benchmark: With approximately
the same memory subsystem, you'd get more or less the same performance,
independent of other difference.

MemoryMorass

unread,

Feb 14, 2003, 9:08:17 PM2/14/03

to

In article <jgjg2b...@miriam.mikron.de>,
Bernd Paysan <bernd....@gmx.de> wrote:

>Terje Mathisen wrote:
>> I'm willing to bet that the Intel architects agonized over every
>> additional pipeline stage they had to add, only doing so when simulation
>> showed that across their benchmark applications, it would be a net win.
>
>Maybe. Or maybe the marketing manager said "Look, we lost the GHz race to
>AMD. Make sure that the Pentium 4 will win the GHz race. There's an easy
>trick: Remember that the technical people at AMD won't extend the pipeline
>further down when it shows that accross their benchmark application it will
>be a loss. So go that far, and we'll always be ahead of AMD in terms of
>GHz. GHz sells, performance is not that important."

The following WAGs might offer little but a few giggles to
Mr. O'Connor's x86 compatriots, but I cannot resist:

First, consider Intel's tag line for the P4: "Performance where you need
it most." For typical desktop applications, wasting even 60% of peak
performance on branch mispredicts would probably have little effect
on perceived responsiveness. OTOH, most compute-intensive
applications (games, multimedia, etc.) are less prone to branch
misprediction. Few desktop/workstation users are primarily concerned
with applications that are CPU-limited (i.e., not memory [or I/O]
latency/bandwidth limited) and have difficult to predict branches. For
servers (which can be CPU-limited with difficult to predict branches),
TLP is available and the P4's limited SMT offers the potential to
effectively halve the pipeline length.

Some of the 'branchy' code (that is not friendly to cmov, branch
merging, et al.) is probably also limited by memory latency.
While P3's smaller core allows for more L2 cache, doubling L2
size would probably not halve the L2 miss rate for pointer-rich,
branchy code. Perhaps the tag line should have been "Performance
where we can actually provide it."

(Even with SMT, the P4 might be relatively weak on heavy database
use because of the slow building of traces. However, x86 server
vendors are necessarily conservative, so Intel can probably afford
to offer a less competative product for this market.)

Second, consider that Intel can afford more than four times the
fixed costs relative to AMD (having 4x the sales volume, higher
average volume [in part from brand strength, in part from having
higher-end products], and seemingly more long-lasting
microarchitectures). An aggressive design that quadruples
development cost while only providing a 5% performance
boost can give Intel a net advantage over AMD because that
5% performance boost can mean a 10% boost in average
selling price.

Assuming equal production cost (and here Intel's design
investment [e.g., making testing/speed binning easier,
reducing the number of bad parts, speeding the learning
curve] and fab expertise might well allow a larger die to
be produced at equal incremental cost):

Intel profit = (1.1 * AMD_asp - incremental_cost) * 4 * AMD_vol
- 4 * AMD_dev_cost - other_fixed_costs
AMD profit = (AMD_asp - incremental_cost) * AMD_vol
- AMD_dev_cost - other_fixed_costs

If AMD breaks even, Intel profit > .4 * AMD_asp * AMD_vol

Ouch! And this is ignoring Intel's brand strength.

The P4 design is extremely aggressive (expensive) and
while it probably does not provide performance benefits
proportionate to the investment, it does provide some
benefits on average.

In article <b2ia5s$1ciu27$1...@ID-7242.news.dfncis.de>,
Roberto Maria Avanzi Mocenigo <moce...@exp-math.uni-essen.de>
wrote:

>I have some code that on a 1.5Ghz P4 performs exactly on par with a
>600Mhz P3. And this code is extremely well optimized. Maybe this is
>an extreme case, but still...

'extremely well optimized' is microarchitecturally dependent.
On a P4, three cmov instructions along with the associated
computation instructions might be faster than a single branch
is on a P3. The P4's smaller Dcache can also have a significant
impact. Even with a brilliant compiler and microarchitecturally
ideal algorithms, some problems will fit one microarchitecture
better than another. Just developing appropriate performance
targets is probably a challenging task for the highly paid
professional architects.

Nick Maclaren

unread,

Feb 15, 2003, 3:49:26 AM2/15/03

to

In article <20030214210817...@mb-cj.aol.com>,

MemoryMorass <memory...@aol.com> wrote:
>In article <jgjg2b...@miriam.mikron.de>,
>Bernd Paysan <bernd....@gmx.de> wrote:
>
>First, consider Intel's tag line for the P4: "Performance where you need
>it most." For typical desktop applications, wasting even 60% of peak
>performance on branch mispredicts would probably have little effect
>on perceived responsiveness. OTOH, most compute-intensive
>applications (games, multimedia, etc.) are less prone to branch
>misprediction. Few desktop/workstation users are primarily concerned
>with applications that are CPU-limited (i.e., not memory [or I/O]

>latency/bandwidth limited) and have difficult to predict branches. ...

I am 90% certain that you are mistaken there.

A huge number of desktop/workstation users spend a lot of time waiting
for a compute intensive, unpredictable code to finish - the machine
may not spend more than 1% of its time in them, but the user spends
all of that time waiting for it. GUIs are the prime culprit, but
any excessively interactive application is likely to be similar, given
the way they are written nowadays.

If you have a typical domineering GUI or full-screen interface, that
forces you to wait before it prompts you for input and serialises 20+
interactions, a delay of seconds for each interaction is REALLY
irritating.

Jouni Osmala

unread,

Feb 15, 2003, 9:56:03 AM2/15/03

to

> >>
> >> To see why deep pipeline might not be good compare P4 and P3 at the same
> >> or similar frequency, or take a look at the frequency where P4 passes P3
> >> in perfomance.
> >
> >
> > I have some code that on a 1.5Ghz P4 performs exactly on par with a
> > 600Mhz P3. And this code is extremely well optimized. Maybe this is
> > an extreme case, but still...
>
> That should be relatively easy to achieve for anything which is memory
> latency limited, like a pointer chasing benchmark: With approximately
> the same memory subsystem, you'd get more or less the same performance,
> independent of other difference.
>
> Terje

What about a way to make 1.5Ghz P4 perform SLOWER than 600 Mhz P3.
Both have 256kb of cache P4 has twice the size of single line. While
P3 has twice as many lines. So you go trough list which elements
smaller than P3:s cache line, and placing those elements in such way
that you waste most of your cache for unnecessary data because too
long cachelines in P4, so you end up chasing pointers in mainmemory.
And for P3 you chase those pointer in L2 cache.
There are plenty of usefull code that CAN get in these situations, in
certain circumstances. Now newer P4:s don't have this disadvantage ;),
(Unless you enable SMT;)

Niels Jørgen Kruse

unread,

Feb 15, 2003, 10:24:32 AM2/15/03

to

I artiklen <slrnb4noo...@ducky.net> , Mike Haertel <mi...@ducky.net>
skrev:

I don't quite understand the use of "commit" here. Is it a problem of being
able to update branch prediction state, or ...?

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Mike Haertel

unread,

Feb 16, 2003, 11:46:44 AM2/16/03

to

In article <JYs3a.63$bU5...@news.get2net.dk>, Niels Jørgen Kruse wrote:
>> In the Pentium 4, the dominant cost of a branch mispredict often
>> comes from out-of-order execution: the processor core is executing
>> even the branches out-of-order, and when a branch mispredict occurs,
>> the processor has to wait for older branches to get out of the way
>> before it can commit to the mispredicting branch. This often takes
>> quite a long time if the older branches are depending on cache misses.
>
> I don't quite understand the use of "commit" here. Is it a problem of being
> able to update branch prediction state, or ...?

In this context "commit to a mispredicting branch" means "begin executing
instructions fetched from the correct PC that the branch should have
gone to."

Nick Maclaren

unread,

Feb 16, 2003, 12:39:06 PM2/16/03

to

In article <slrnb4vga0...@ducky.net>,

Which leads to a question which has puzzled me for a long time.
An alternative approach to branch prediction was to follow all
paths (or all likely ones), which obviously leads to a combinatoric
explosion. But there is a relatively cheap approach, which looks
as if it would match very well to current systems, and used to be
used. It is out of fashion, and I don't know why.

That is, all execution is in order, except for instruction decoding
and operand preloading (but to cache or 'hidden' registers). When
an uncertain branch is encountered, both paths are taken with some
sort of 'priority'. And, of course, actually needed operands get
top priority. Untaken paths get cancelled, and so on.

Now there are obviously problems, such as an intermediate update of
a register needed for addressing, and running out of cache bandwidth
and all that. But it would be a relatively simple way of improving
lookahead to multiple paths without causing TOO much complexity.
And, as I say, some systems did things along those lines.

Andy Glew

unread,

Feb 17, 2003, 6:34:16 AM2/17/03

to

"Michael S" <already...@yahoo.com> wrote in message
news:f881b862.03021...@posting.google.com...

> Attention. Now I am going to say something I have no clue about.
> One. Two. Three...
> The decrease in the ratio of dynamic power to leakage power
> consumption makes higher clock rates and deeper pipelines more and
> more competitive. Even for the SPEC/Wt marks.
> Ugh...

Is this the case? Or, maybe it is the case, but is it desirable?

Deeper pipelines => more pipeline latches.
More pipeline latches => more leakage,
=> more power "wasted" in overhead, not actual computation.

The above effect is approximately linear in pipeline depth.

Assuming that the activity factor stays constant as the
pipeline gets deeper, the dynamic popwer would also
seem to scale linearly with pipeline depth.
So, with dynamic and static power scaling linearly with
pipeline depth, the dynamic/static ratio changes
mostly on the base microarchitecture.

But: does activity factor change linearly with pipeline
depth?

Some of the deep pipelines really are not all that
deep - or, rather, they are pipelined loops embedded in
other pipelined loops. E.g. the trace cache fill pipeline
is normally inactive. Really, the innermost pipeline
is from trace cache through execution to writeback.
Even waiting until retirement is a decoupled pipeline
with different characteristics.

So, the activity factor of the trace cache fill and
commit/retirement pipelines is considerably lower
than that of the innermost execution pipeline.
I'm tempted to say that only the innermost
linear pipelines have an activity factor
that scales as the pipeline depth, but even here
I am not 100% certain.

---

We need better terminology: not "pipeline", but "pipelined
network of decoupled pipelines".

Many years ago, an interviewer for Sun told me that
Bill Joy was a great pipeline designer. My immediate
reaction was that
that might explain why Sun has lagged other companies
so much in CPU architecture, if they were still thinking
at the level of pipelines and not networks of decoupled
pipelines.

Andy Glew

unread,

Feb 17, 2003, 7:58:53 AM2/17/03

to

"Mike Haertel" <mi...@ducky.net> wrote in message
news:slrnb4vga0...@ducky.net...

I don't know how much of the Pentium 4 mechanism is public.
The P6 mechanism is pretty well known.
And we can talk abstractly.

There are several components of starting to execute the
correct path:

(1) Starting to fetch the correct path:
can be done almost immediately (as in P6)

(2) Repairing the dependency tracking state
for both registers (renamer) and memory:
* If using checkpointing, can be done almost immediately
* Ditto something like a history CAM
But, keeping an arbitrarily large number of full state
checkpoints doesn't scale. History CAMs scale, but
are pretty costly to begin with ("The first thing we'll
do is eliminate all of the CAMs.") So a more incremental
strategy is probably necessary.
Rebuilding the renamer state may therefore be a slightly
incremental process, taking time proportional to the distance
in number of "steps" between a checkpoint that can be restored
almost immediately, and the mispredict. Naively done, this
might be proportional to the number of instructions between
the checkpoint and the mispredict. Done better, it is proportional
to the number of values that have changed between the
checkpoint and the renamer.
However, it need not wait for long latency instructions
like cache misses to complete.

Note that there is usually only
a fixed number of registers, although there can be a large
number of memory locations.

Also note that you do not *need* to repair the renamer state
completely before starting to execute the correct path.
* You can speculate that some instructions are independent
of the wrong path - i.e. do not depend on values changed on
the wrong path - and start going.
NOTE: a branch misprediction recovery cache can
record such independence info
* You can certainly decode and rename new correct path
instructions, so long as you can complete the renaming out-of-order
- i.e. instructions whose dependencies might have been
modified on the wrong path can be inserted in the scheduler,
as long as they are marked dependent on renaming that has not
yet been completed. Instructions that are not dependent on these
can be executed immediately. Instructions dependent on these
can be fully renamed, but will be blocked until the partially
renamed instructions are fully renamed and can complete.

Sory, my daughter insists that I spend no more vacation at this time
on comp.arch. Briefly wrapping up:

* aggressive out-of-order renaming can start execution
before all rename repair is done.
Independence can be predicted; independence
can be detected via simple tagging schemes

(3) Finally, recovering resources associated with wrong path instructions
need not be done before correct path instructions start executing.
It can be done "in the background".

Therefore, there is no intrinsic reason why starting execution of correct
path instructions has to wait for all older branches to get out of the way,
and certainly not wait for cache miss latency.

To have done so is a design decision that may be appropriate, given
your simulation results.

But it's not a requirement.

Michael S

unread,

Feb 17, 2003, 12:39:35 PM2/17/03

to

"Andy Glew" <andy_gle...@yahoo.com> wrote in message news:<YM34a.767$jd7.18...@newssvr13.news.prodigy.com>...

> "Michael S" <already...@yahoo.com> wrote in message
> news:f881b862.03021...@posting.google.com...
>
> > Attention. Now I am going to say something I have no clue about.
> > One. Two. Three...
> > The decrease in the ratio of dynamic power to leakage power
> > consumption makes higher clock rates and deeper pipelines more and
> > more competitive. Even for the SPEC/Wt marks.
> > Ugh...
>
> Is this the case? Or, maybe it is the case, but is it desirable?
>
> Deeper pipelines => more pipeline latches.
> More pipeline latches => more leakage,
> => more power "wasted" in overhead, not actual computation.
>
> The above effect is approximately linear in pipeline depth.
>
> Assuming that the activity factor stays constant as the
> pipeline gets deeper, the dynamic popwer would also
> seem to scale linearly with pipeline depth.
> So, with dynamic and static power scaling linearly with
> pipeline depth, the dynamic/static ratio changes
> mostly on the base microarchitecture.
>
>

Since I started to talk about something I have no clue about it's only
logical to continue... May be I will end up with no clue but at least
a bit of Glew.

Pipeline is build of the latches and combinatorial logic between the
latches. Deeper pipelines => more pipeline latches but the amount of
the combinatorial logic is almost unchanged. I assume that both
latches and combinatorial logic have leakage. If it is right the
static power does scale linearly with pipeline depth (P = Pcmb +
Platch * N), but it is not _proportianal_ to the pipeline depth. So
some saving is still here.
When I say saving I compare deep pipeline approach with the
alternative - addition of more parallel execution units. With parallel
execution units the static power is proportional to the number of
units as well as dynamic power is "less than proportional", because on
average units are less busy.
Then we can consider deep pipelines designs "a leakage friendly" and
wide superscalar designs "a switching friendly".

Roberto Maria Avanzi Mocenigo

unread,

Feb 17, 2003, 12:39:29 PM2/17/03

to

Terje Mathisen wrote:
> Roberto Maria Avanzi Mocenigo wrote:
>
>> Sander Vesik wrote:
>>
>>>
>>> To see why deep pipeline might not be good compare P4 and P3 at the same
>>> or similar frequency, or take a look at the frequency where P4 passes P3
>>> in perfomance.
>>
>>
>>
>> I have some code that on a 1.5Ghz P4 performs exactly on par with a
>> 600Mhz P3. And this code is extremely well optimized. Maybe this is
>> an extreme case, but still...
>
>
> That should be relatively easy to achieve for anything which is memory
> latency limited, like a pointer chasing benchmark: With approximately
> the same memory subsystem, you'd get more or less the same performance,
> independent of other difference.

Uh, well, no. There are a lot of memory accesses, but the P4 machine has
more cache, and a faster FSB, DDR... I do not remember the specs, but the
machine seemed impressive. And I "scaled" down the P3. The P3 is a
1Ghz one, but performed 40% better than the 1.5Ghz P4. 256 Kb cache and
a 133 Mhz bus, that's all for the P3.

ciao

Terje Mathisen

unread,

Feb 17, 2003, 4:34:55 PM2/17/03

to

Roberto Maria Avanzi Mocenigo wrote:

> Terje Mathisen wrote:
>> That should be relatively easy to achieve for anything which is memory
>> latency limited, like a pointer chasing benchmark: With approximately
>> the same memory subsystem, you'd get more or less the same
>> performance, independent of other difference.
>
>
> Uh, well, no. There are a lot of memory accesses, but the P4 machine has
> more cache, and a faster FSB, DDR... I do not remember the specs, but the
> machine seemed impressive. And I "scaled" down the P3. The P3 is a
> 1Ghz one, but performed 40% better than the 1.5Ghz P4. 256 Kb cache and
> a 133 Mhz bus, that's all for the P3.

OK, so what is the algorithm???

Michael S

unread,

Feb 17, 2003, 6:54:03 PM2/17/03

to

Roberto Maria Avanzi Mocenigo <moce...@exp-math.uni-essen.de> wrote in message news:<b2r6s5$1el095$1...@ID-7242.news.dfncis.de>...

> Terje Mathisen wrote:
> > Roberto Maria Avanzi Mocenigo wrote:
> >
> >> Sander Vesik wrote:
> >>
> >>>
> >>> To see why deep pipeline might not be good compare P4 and P3 at the same
> >>> or similar frequency, or take a look at the frequency where P4 passes P3
> >>> in perfomance.
> >>
> >>
> >>
> >> I have some code that on a 1.5Ghz P4 performs exactly on par with a
> >> 600Mhz P3. And this code is extremely well optimized. Maybe this is
> >> an extreme case, but still...
> >
> >
> > That should be relatively easy to achieve for anything which is memory
> > latency limited, like a pointer chasing benchmark: With approximately
> > the same memory subsystem, you'd get more or less the same performance,
> > independent of other difference.
>
> Uh, well, no. There are a lot of memory accesses, but the P4 machine has
> more cache, and a faster FSB, DDR... I do not remember the specs, but the
> machine seemed impressive. And I "scaled" down the P3. The P3 is a
> 1Ghz one, but performed 40% better than the 1.5Ghz P4. 256 Kb cache and
> a 133 Mhz bus, that's all for the P3.
>
> ciao
> Roberto

I can think of two possible reasons:
1) Data set in range [8K..16K]
2) A large load after a series of small stores or a series of small
loads after a large store.

The 1) has nothing to do with a deep pipeline.

The effect of the 2) is worsen by a deep pipeline. However the problem
is possible to resolve completely with more complex forwarding HW. So
it's more a case of P-IV specific problem than the common deep
pipeline problem. BTW, in case of P-IV it is often possible to replace
these sequences with SSE2 Shift/Extract/Shuffle operations as
recommended in the Intel's optimization guide.

del cecchi

unread,

Feb 17, 2003, 9:52:38 PM2/17/03

to

"Michael S" <already...@yahoo.com> wrote in message
news:f881b862.03021...@posting.google.com...
> "Andy Glew" <andy_gle...@yahoo.com> wrote in message
news:<YM34a.767$jd7.18...@newssvr13.news.prodigy.com>...
> > "Michael S" <already...@yahoo.com> wrote in message
> > news:f881b862.03021...@posting.google.com...
> >
>

> Since I started to talk about something I have no clue about it's only
> logical to continue... May be I will end up with no clue but at least
> a bit of Glew.
>
> Pipeline is build of the latches and combinatorial logic between the
> latches. Deeper pipelines => more pipeline latches but the amount of
> the combinatorial logic is almost unchanged. I assume that both
> latches and combinatorial logic have leakage. If it is right the
> static power does scale linearly with pipeline depth (P = Pcmb +
> Platch * N), but it is not _proportianal_ to the pipeline depth. So
> some saving is still here.

Varies linearly with == proportional to

> When I say saving I compare deep pipeline approach with the
> alternative - addition of more parallel execution units. With parallel
> execution units the static power is proportional to the number of
> units as well as dynamic power is "less than proportional", because on
> average units are less busy.
> Then we can consider deep pipelines designs "a leakage friendly" and
> wide superscalar designs "a switching friendly".

To a first approximation, more transistors means more leakage. Or at
least, more logic gates means more leakage. Many dynamic gates use
power even when not "switching", also latches do and clock networks do.

del cecchi

Peter Dickerson

unread,

Feb 18, 2003, 3:56:51 AM2/18/03

to

del cecchi wrote:
> "Michael S" <already...@yahoo.com> wrote in message
> news:f881b862.03021...@posting.google.com...
>> "Andy Glew" <andy_gle...@yahoo.com> wrote in message
> news:<YM34a.767$jd7.18...@newssvr13.news.prodigy.com>...
>>> "Michael S" <already...@yahoo.com> wrote in message
>>> news:f881b862.03021...@posting.google.com...
>>>
>>
>> Since I started to talk about something I have no clue about it's
>> only logical to continue... May be I will end up with no clue but at
>> least a bit of Glew.
>>
>> Pipeline is build of the latches and combinatorial logic between the
>> latches. Deeper pipelines => more pipeline latches but the amount of
>> the combinatorial logic is almost unchanged. I assume that both
>> latches and combinatorial logic have leakage. If it is right the
>> static power does scale linearly with pipeline depth (P = Pcmb +
>> Platch * N), but it is not _proportianal_ to the pipeline depth. So
>> some saving is still here.
>
> Varies linearly with == proportional to

Sorry, Del, you are most definitely wrong on this one. y = m*x+c means y
varies linearly with x if m and c are constants, but y is only proportional
to x if c is 0.

>> When I say saving I compare deep pipeline approach with the
>> alternative - addition of more parallel execution units. With
>> parallel execution units the static power is proportional to the
>> number of units as well as dynamic power is "less than
>> proportional", because on average units are less busy.
>> Then we can consider deep pipelines designs "a leakage friendly" and
>> wide superscalar designs "a switching friendly".
>
> To a first approximation, more transistors means more leakage. Or at
> least, more logic gates means more leakage. Many dynamic gates use
> power even when not "switching", also latches do and clock networks
> do.
>
> del cecchi

--
Peter
Peter.Dickerson (at) ukonline (dot) co (dot) uk

Martin Høyer Kristiansen

unread,

Feb 18, 2003, 4:25:19 AM2/18/03

to

Roberto Maria Avanzi Mocenigo wrote:

> Terje Mathisen wrote
<snip>

>>
>> That should be relatively easy to achieve for anything which is memory
>> latency limited, like a pointer chasing benchmark: With approximately
>> the same memory subsystem, you'd get more or less the same
>> performance, independent of other difference.
>
>
> Uh, well, no. There are a lot of memory accesses, but the P4 machine has
> more cache, and a faster FSB, DDR... I do not remember the specs, but the
> machine seemed impressive. And I "scaled" down the P3. The P3 is a
> 1Ghz one, but performed 40% better than the 1.5Ghz P4. 256 Kb cache and
> a 133 Mhz bus, that's all for the P3.

How big was your inner loop ? Would it be big enough to thrash the trace
cache of the P4, but small enough to fit in the 32KB Icache of the P3 ?

This would esentially make the P4 single issue (or worse).

Cheers
Martin

Kai Harrekilde-Petersen

unread,

Feb 18, 2003, 5:10:33 AM2/18/03

to

Terje Mathisen <terje.m...@hda.hydro.com> writes:

> Roberto Maria Avanzi Mocenigo wrote:
> > Terje Mathisen wrote:
> >> That should be relatively easy to achieve for anything which is
> >> memory latency limited, like a pointer chasing benchmark: With
> >> approximately the same memory subsystem, you'd get more or less the
> >> same performance, independent of other difference.
> > Uh, well, no. There are a lot of memory accesses, but the P4
> > machine has
> > more cache, and a faster FSB, DDR... I do not remember the specs, but the
> > machine seemed impressive. And I "scaled" down the P3. The P3 is a
> > 1Ghz one, but performed 40% better than the 1.5Ghz P4. 256 Kb cache and
> > a 133 Mhz bus, that's all for the P3.
>
> OK, so what is the algorithm???

I have encountered a program that made a P4 2.4GHz look stupid
compared to a 700MHz Athlon (Slot-A).

P4: ~14 hours
Athlon: 6 hours.

The P4 has 1 * 512MB DDR2700 module, and the Athlon has 2 * 128MB of
PC133-SDRAM (I think).

The program is DAEQuality, which create a 74min audio image used for
testing the error correction capabilities of a CDROM drive:
http://www.exactaudiocopy.de/eac13.html (Pages are in english. You
need to burn the image + make some specified scratches and changes to
the burned CD).

I don't have the sourcecode, so the binary could easily have been
compiled for "optimal" athlon performance and "pessimal" p4
performance.

Andy Glew

unread,

Feb 19, 2003, 10:43:16 AM2/19/03

to

> When I say saving I compare deep pipeline approach with the
> alternative - addition of more parallel execution units. With parallel
> execution units the static power is proportional to the number of
> units as well as dynamic power is "less than proportional", because on
> average units are less busy.
> Then we can consider deep pipelines designs "a leakage friendly" and
> wide superscalar designs "a switching friendly".

Some random thoughts:

Some modern machines spend fractions approaching 50%
of their power, both static and dynamic, in their pipeline latches.

The classic "parallel execution unit" argument for low power
is that you can have more slower execution units compared
to fewer faster execution units. And, the slower execution units
may consume drastically less power, both dynamic (because
slower may mean that you can reduce V, as in CV^2f)
and dynamic (because slower means you can use slower
transistors, and leakage is exponential in transistor speed).

(Of course, if what you need is lower EU latency rather than
EU bandwidth, this doesn't help. But I think only cache misses
are really latency sensitive nowadays.)

If these statements are true, then it is hard to see deep pipelines
as leakage friendly.

Possibly, in a given technology, given transistor speed and Vcc,
deep pipelines *might* be, up until the point where pipeline latches
and clock related stuff dominates the actual combinatoric circuitry.

Roberto Maria Avanzi Mocenigo

unread,

Feb 19, 2003, 12:30:55 PM2/19/03

to

Terje Mathisen wrote:
[p4 slow]

>> Uh, well, no. There are a lot of memory accesses, but the P4 machine has
>> more cache, and a faster FSB, DDR... I do not remember the specs, but the
>> machine seemed impressive. And I "scaled" down the P3. The P3 is a
>> 1Ghz one, but performed 40% better than the 1.5Ghz P4. 256 Kb cache and
>> a 133 Mhz bus, that's all for the P3.
>
>
> OK, so what is the algorithm???

MANY algorithms, actually.
The program computes scalar multiplications
on elliptic curves defined over a prime fields.
It uses either GMP 4.1 or anther library written
by me, which is extremely aggressive for the input
sizes which is handles, and is usually 3 times
faster than gmp for that kind of operations.

At least 10 different algorithms have been
tested and a horrible number of different
formulae for computing additions and doubling
on the curves themselves.

On top of it, we do the same with binary field
curves.

The Piv 1.5 Ghz takes invariably 30% more time
than the Piii 1Ghz to do each test.

For us, those processors are crap...

Roberto Maria Avanzi Mocenigo

unread,

Feb 19, 2003, 12:37:37 PM2/19/03

to

Martin Høyer Kristiansen wrote:

>
>
> How big was your inner loop ? Would it be big enough to thrash the trace
> cache of the P4, but small enough to fit in the 32KB Icache of the P3 ?

In most cases the data processed was just a very few Kbytes, not more
than 4. The code is pretty simple.

Roberto Maria Avanzi Mocenigo

unread,

Feb 19, 2003, 12:40:14 PM2/19/03

to

Roberto Maria Avanzi Mocenigo wrote:

> Terje Mathisen wrote:
> [p4 slow]
>
>>> Uh, well, no. There are a lot of memory accesses, but the P4 machine
>>> has
>>> more cache, and a faster FSB, DDR... I do not remember the specs, but
>>> the
>>> machine seemed impressive. And I "scaled" down the P3. The P3 is a
>>> 1Ghz one, but performed 40% better than the 1.5Ghz P4. 256 Kb cache and
>>> a 133 Mhz bus, that's all for the P3.
>>
>> OK, so what is the algorithm???
>
>
> MANY algorithms, actually.
> The program computes scalar multiplications
> on elliptic curves defined over a prime fields.

^^^
over prime fields.

> It uses either GMP 4.1 or anther library written

^^^^^
another

> by me, which is extremely aggressive for the input

aggressively optimised

> sizes which is handles, and is usually 3 times

^^^

which IT handles

> faster than gmp for that kind of operations.

(here I mean modular adds, subs, muls and inversions)

>
> At least 10 different algorithms have been
> tested and a horrible number of different

really horrible

> formulae for computing additions and doubling
> on the curves themselves.
>
>
> On top of it, we do the same with binary field
> curves.
>
> The Piv 1.5 Ghz takes invariably 30% more time
> than the Piii 1Ghz to do each test.
>

I have to post when I am less tired

Andy Glew

unread,

Feb 19, 2003, 10:46:08 AM2/19/03

to

> Varies linearly with == proportional to

Although that was what I was saying, I think
the guy who corrected me was right.

Linearly: y = ax+b
Proportional: y = ax
actually y1/y2 = x1/x2
which is not true for linear.

He made the same error in his discussion
of parallel execution units.

I stand corrected.
But I'll probably make the same mistake again.

Terje Mathisen

unread,

Feb 19, 2003, 1:02:54 PM2/19/03

to

Roberto Maria Avanzi Mocenigo wrote:

> MANY algorithms, actually.
> The program computes scalar multiplications
> on elliptic curves defined over a prime fields.
> It uses either GMP 4.1 or anther library written
> by me, which is extremely aggressive for the input
> sizes which is handles, and is usually 3 times
> faster than gmp for that kind of operations.

[snip]

> The Piv 1.5 Ghz takes invariably 30% more time
> than the Piii 1Ghz to do each test.
>
> For us, those processors are crap...

OK, that makes some sense:

One of the big achilles heels of the P4 is that both shifts and integer
MULs got trashed simultaneously.

The x86 gmp lib uses integer MULs a lot, since that was by far the best
method on all P6-class cpus.

I don't know if it would be possible to do better using MMX/SSE/SSE2
operations instead.

Michael S

unread,

Feb 20, 2003, 4:09:40 AM2/20/03

to

"Andy Glew" <andy_gle...@yahoo.com> wrote in message news:<4FN4a.2084$mT3.17...@newssvr21.news.prodigy.com>...

> > Varies linearly with == proportional to
>
> Although that was what I was saying, I think
> the guy who corrected me was right.
>
> Linearly: y = ax+b
> Proportional: y = ax
> actually y1/y2 = x1/x2
> which is not true for linear.
>
> He made the same error in his discussion
> of parallel execution units.

Ok. In my discussion of parallel execution units I claimed that a
total static (leakage) power for the parallel execution units is
proportional to the number of units. Do you say I forgot about some
important constant component ? Very likely. Can you tell me which one
?

Andy Nelson

unread,

Feb 20, 2003, 6:05:25 AM2/20/03

to

Roberto Maria Avanzi Mocenigo <moce...@exp-math.uni-essen.de> wrote:
> <a long spelling flame of his own post>

Wow! I've never seen anyone produce such in depth
spelling flames of their own post before!

> I have to post when I am less tired

I'd say don't worry about it but get some sleep anyway.
You'll feel better for it. :-)

Cheers,

Andy

--
Andy Nelson School of Mathematics
an...@maths.ed.ac.uk University of Edinburgh
http://maths.ed.ac.uk/~andy Edinburgh Scotland EH9 3JZ U. K.

Anton Ertl

unread,

Feb 20, 2003, 12:22:51 PM2/20/03

to

Bernd Paysan <bernd....@gmx.de> writes:
>Terje Mathisen wrote:
>> I'm willing to bet that the Intel architects agonized over every
>> additional pipeline stage they had to add, only doing so when simulation
>> showed that across their benchmark applications, it would be a net win.
>
>Maybe. Or maybe the marketing manager said "Look, we lost the GHz race to
>AMD. Make sure that the Pentium 4 will win the GHz race. There's an easy
>trick: Remember that the technical people at AMD won't extend the pipeline
>further down when it shows that accross their benchmark application it will
>be a loss. So go that far, and we'll always be ahead of AMD in terms of
>GHz. GHz sells, performance is not that important."

The latest ISCA proceedings contain some articles on pipeline depth,
and all indicate that the optimum (for performance) pipeline depth of
the CPUs they analysed would be significantly larger than it actually
is (52 stages for the Pentium 4, with a very flat curve in the
52-70-stage area). Note that the optimal pipeline depth depends on
the rest of the microarchitecture, in particular the branch predictor,
so with a better branch predictor the optimum depth might be even
larger. If you hve time for only one of these papers, I would
recommend [sprangle&carmean02].

@InProceedings{hartstein&puzak02,
author = {A. Hartstein and Thomas R. Puzak},
title = {The Optimum Pipeline Depth for a Microprocessor},
crossref = {isca02},
pages = {7--13},
annote = {Presents a formula for the performance of a
microprocessor when varying the pipeline length; the
optimum pipeline length can be derived from
this. Unfortunately there are two parameters in the
formulae that depend on the microarchitecture and
the workload, and these parameters cannot be
determined analytically, only empirically. The paper
also presents data from runs of a simulator with
different pipeline lengths, and different (but
hardly specified) workloads. The results match with
curves from the formula (after matching for the
missing parameters). One interesting result was that
the SPEC workloads had a shorter optimum pipeline
length than the other workloads used in the paper.}
}

@InProceedings{hrishikesh+02,
author = {M. S. Hrishikesh and Norman P. Jouppi and Keith
I. Farkas and Doug Burger and Stephen W. Keckler and
Premkishore Shivakumar},
title = {The Optimal Logic Depth per Pipeline Stage is 6 to 8
FO4 Inverter Delays},
crossref = {isca02},
pages = {14--24},
annote = {This paper takes a low-level simulator of the 21264,
varies the number of pipeline stages, uses this to
run a number of workloads (actually only traces from
them), and reports performance results for
them. With a latch overhead of about 2 FO4
inverters, the optimal pipeline stage length is
about 8 FO4 inverters (with work-load-dependent
variations). Discusses various issues involved in
quite some depth. In particular, this paper
discusses how to pipeline the instruction window
design (which has been identified as a bottleneck in
earlier papers).}
}

@InProceedings{sprangle&carmean02,
author = {Eric Sprangle and Doug Carmean},
title = {Increasing Processor Performance by Implementing
Deeper Pipelines},
crossref = {isca02},
pages = {25--34},
annote = {This paper starts with the Williamette (Pentium~4)
pipeline and discusses and evaluates changes to the
pipeline length. In particular, it gives numbers on
how lengthening various latencies would affect IPC;
on a per-cycle basis the ALU latency is most
important, then L1 cache, then L2 cache, then branch
misprediction; however, the total effect of
lengthening the pipeline to double the clock rate
gives the reverse order (because branch
misprediction gains more cycles than the other
latencies). The paper reports 52 pipeline stages
with 1.96 times the original clock rate as optimal
for the Pentium~4 microarchitecture, resulting in a
reduction of 1.45 of core time and an overall
speedup of about 1.29 (including waiting for
memory). Various other topics are discussed, such as
nonlinear effects when introducing bypasses, and
varying cache sizes. Recommended reading.}
}

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Terje Mathisen

unread,

Feb 20, 2003, 4:46:20 PM2/20/03

to

Anton Ertl wrote:
> Bernd Paysan <bernd....@gmx.de> writes:
>
>>Terje Mathisen wrote:
>>
>>>I'm willing to bet that the Intel architects agonized over every
>>>additional pipeline stage they had to add, only doing so when simulation
>>>showed that across their benchmark applications, it would be a net win.
>>
>>Maybe. Or maybe the marketing manager said "Look, we lost the GHz race to
>>AMD. Make sure that the Pentium 4 will win the GHz race. There's an easy
>>trick: Remember that the technical people at AMD won't extend the pipeline
>>further down when it shows that accross their benchmark application it will
>>be a loss. So go that far, and we'll always be ahead of AMD in terms of
>>GHz. GHz sells, performance is not that important."
>
> The latest ISCA proceedings contain some articles on pipeline depth,
> and all indicate that the optimum (for performance) pipeline depth of
> the CPUs they analysed would be significantly larger than it actually
> is (52 stages for the Pentium 4, with a very flat curve in the
> 52-70-stage area). Note that the optimal pipeline depth depends on
> the rest of the microarchitecture, in particular the branch predictor,
> so with a better branch predictor the optimum depth might be even
> larger. If you hve time for only one of these papers, I would
> recommend [sprangle&carmean02].
>

[snip]

> @InProceedings{sprangle&carmean02,
> author = {Eric Sprangle and Doug Carmean},
> title = {Increasing Processor Performance by Implementing
> Deeper Pipelines},
> crossref = {isca02},
> pages = {25--34},
> annote = {This paper starts with the Williamette (Pentium~4)
> pipeline and discusses and evaluates changes to the
> pipeline length. In particular, it gives numbers on
> how lengthening various latencies would affect IPC;
> on a per-cycle basis the ALU latency is most
> important, then L1 cache, then L2 cache, then branch
> misprediction; however, the total effect of
> lengthening the pipeline to double the clock rate
> gives the reverse order (because branch
> misprediction gains more cycles than the other
> latencies). The paper reports 52 pipeline stages
> with 1.96 times the original clock rate as optimal
> for the Pentium~4 microarchitecture, resulting in a
> reduction of 1.45 of core time and an overall
> speedup of about 1.29 (including waiting for
> memory). Various other topics are discussed, such as
> nonlinear effects when introducing bypasses, and
> varying cache sizes. Recommended reading.}
> }

Thanks!

I did note that one of those sited papers had found that Spec gave lower
optimums than their other benchmarks.

This probably means that the set of applications tested had very little
problems with unpredictable branches.

Mike Haertel

unread,

Feb 20, 2003, 10:06:39 PM2/20/03

to

In article <jgjg2b...@miriam.mikron.de>, Bernd Paysan wrote:
> Maybe. Or maybe the marketing manager said "Look, we lost the GHz race to
> AMD. Make sure that the Pentium 4 will win the GHz race. There's an easy
> trick: Remember that the technical people at AMD won't extend the pipeline
> further down when it shows that accross their benchmark application it will
> be a loss. So go that far, and we'll always be ahead of AMD in terms of
> GHz. GHz sells, performance is not that important."

I assure you, that is *not* what happened at Intel with the Pentium 4.

There was absolutely no marketing-driven conspiracy to increase
clock rate purely for marketing reasons. We (the architecture team)
simply wanted to steal a page from Alpha's playbook, and have a
really high clock rate, and there was a lot of really difficult
circuit design was driven by that desire.

At the same time we were also trying for good performance/cycle,
and actually ended up just getting less than we'd hoped for in that
department (thanks to some architecture screwups).

Eventually it became clear that we had to push the frequency extra
hard to make up for the performance/cycle not being so good. So
the reason the P4's have obscenely high frequency is simply that
Intel didn't have a choice if it wanted to be able to sell them.
You can thank Intel's circuit designers and process developers (not
to mention the competition from Athlon) for that.

Now, normally you might not believe it when an Intel person says
there was no conspiracy, but in this case perhaps you should:
I no longer work for Intel, and in fact now work for AMD. So, if
in fact my old employer really did have an evil conspiracy against
my current employer, I wouldn't exactly be rushing to defend them.

Obviously this is my personal opinion and I am not speaking for
either my former nor current employers!
--
Mike Haertel <mi...@ducky.net>

Terje Mathisen

unread,

Feb 21, 2003, 2:16:11 AM2/21/03

to

Mike Haertel wrote:
> At the same time we were also trying for good performance/cycle,
> and actually ended up just getting less than we'd hoped for in that
> department (thanks to some architecture screwups).

SHL/SHR and MUL among them, or was that an intentional disaster?
:-)

I.e. if integer ops are calculated in a pipelined manner, so that the
most significant bits become available for subsequent operations later
then the lower bits, then this will obviously cause a stutter on any
rotate or shift right operation. SHL should still have been OK though.

Why it should cause problems for MUL is less obvious (at least to me).

Andy Glew

unread,

Feb 21, 2003, 2:41:25 AM2/21/03

to

> I did note that one of those sited papers had found that Spec gave lower
> optimums than their other benchmarks.
>
> This probably means that the set of applications tested had very little
> problems with unpredictable branches.
>
> Terje

Branch predictors have gotten better,
the Pentium 4's is supposed to be particularly good,
so good that is was a secret, that I wasn't allowed
to know how it worked, when I returned to Intel
2000-2002.

However, probably the most important thing has
been a "move to the middle" wrt branch misprediction
rates on workloads, even on old branch predictors.

I still really like GCC, but even SpecIntYYYY is increasingly
dominated by codes that don't mispredict that much.
The increasing importance of multimedia, JPEG, MPEG, etc.
have reduced branch misprediction rates, at least in terms
of branch mispredictions per (thousand) instructions.
Branch mispredictions per (hundred) branches may not
be getting that much better as workloads evolve
(although they do as branch predictors evolve),
but when there are more instructions, the performance impact
is less. Also, cache misses dominate more and more,
diluting the effect of branch mispredictions in codes where
the mispredicted branch does not depend on the cache misses.

Conversely, there are fewer array crunching codes.
More and more scientific codes don't just loop
for thousands and thousands of iterations, but instead
are optimized, and use smart algorithms to skip parts
of the arrays, divide the arrays into small linked chunks,
etc. I.e. more and more scientific codes begin to look
like integer codes - except that the computational interludes
are still bug chunks of FP.

Thus, branch misprediction rates across workloads have moved
to the middle.

Eric Young

unread,

Feb 21, 2003, 3:29:52 AM2/21/03

to

Terje Mathisen wrote:
> Mike Haertel wrote:
>
>> At the same time we were also trying for good performance/cycle,
>> and actually ended up just getting less than we'd hoped for in that
>> department (thanks to some architecture screwups).
>
>
> SHL/SHR and MUL among them, or was that an intentional disaster?
> :-)
>
> I.e. if integer ops are calculated in a pipelined manner, so that the
> most significant bits become available for subsequent operations later
> then the lower bits, then this will obviously cause a stutter on any
> rotate or shift right operation. SHL should still have been OK though.
>
> Why it should cause problems for MUL is less obvious (at least to me).

Improved imul latency is listed as one of the improvements
in the prescott rev of the P4 micro-achitecture; assuming
www.theinquirer.net got the details right.
Obviously intel feels it is important enough to fix, in much
the same way the usparcIII improved the integer multiplication
over the usparcII

eric

Mike Haertel

unread,

Feb 21, 2003, 6:03:17 AM2/21/03

to

In article <b34jnr$gf0$1...@vkhdsu24.hda.hydro.com>, Terje Mathisen wrote:
> Mike Haertel wrote:
>> At the same time we were also trying for good performance/cycle,
>> and actually ended up just getting less than we'd hoped for in that
>> department (thanks to some architecture screwups).
>
> SHL/SHR and MUL among them, or was that an intentional disaster?
>:-)

I dunno about "intentional disasters", but it's fair to say that we
(or at least some of us) were grinding our teeth when those particular
decisions were made.

Shifts are slow because the circuit designers couldn't fit them
into the high speed integer ALU's and still hit the target frequency,
so they were implemented in the "slow integer ops" unit, some
distance from the high speed logic, and at the slower medium clock
frequency. About 3/4 of the latency of shifts is concerned with
shipping the data across the die and changing clock domains twice.

In retrospect it may have been fortunate that the circuit designers
insisted on keeping the high speed ALU's lean, despite the cost
to shifts. When it became necessary to *really* push frequency
later on, less minimal ALU's could have made life very unpleasant.
(I actually left Intel in October 2000, shortly before the first P4
was introduced for sale, so I didn't get to see any of the subsequent
push to reach 2 GHz in .18u. I'm still impressed they managed that.)

> I.e. if integer ops are calculated in a pipelined manner, so that the
> most significant bits become available for subsequent operations later
> then the lower bits, then this will obviously cause a stutter on any
> rotate or shift right operation. SHL should still have been OK though.

Waiting for all of bits to arrive from the staggered datapath accounts
for about 1 cycle of the shift latency. You are correct that we could
have handled left shifts faster, however it helped the design of the
scheduler to have the same latency for all slow integer ops.

> Why it should cause problems for MUL is less obvious (at least to me).

IMUL is slow for a more prosaic reason: the dedicated integer multiplier
was removed in a die diet. (You may have noticed that P4 is a
pretty big die. This is *after* the die diets!) Most of the cost
of IMUL is in shipping the bits across the die to the FP multiplier.
This hurts both in latency as well as extra uops used in shipping
the bits around. The extra uops mean IMUL is effectively depipelined,
even though the multiplier itself is pipelined.

Terje Mathisen

unread,

Feb 21, 2003, 8:22:09 AM2/21/03

to

Mike Haertel wrote:
> In article <b34jnr$gf0$1...@vkhdsu24.hda.hydro.com>, Terje Mathisen wrote:
>
>>Mike Haertel wrote:
>>
>>>At the same time we were also trying for good performance/cycle,
>>>and actually ended up just getting less than we'd hoped for in that
>>>department (thanks to some architecture screwups).
>>
>>SHL/SHR and MUL among them, or was that an intentional disaster?
>>:-)
>
>
> I dunno about "intentional disasters", but it's fair to say that we
> (or at least some of us) were grinding our teeth when those particular
> decisions were made.

:-)

>>I.e. if integer ops are calculated in a pipelined manner, so that the
>>most significant bits become available for subsequent operations later
>>then the lower bits, then this will obviously cause a stutter on any
>>rotate or shift right operation. SHL should still have been OK though.
>
> Waiting for all of bits to arrive from the staggered datapath accounts
> for about 1 cycle of the shift latency. You are correct that we could
> have handled left shifts faster, however it helped the design of the
> scheduler to have the same latency for all slow integer ops.

OK.

In retrospect, having a fast path for short (low count) SHL would have
been very nice, since it would have allowed the Pentium-class
workarounds for the int-fp-int multiplier to be used.

Even if this was only applicable for fixed-count versions, i.e. not
those using SHx reg,CL to specify a variable count, it would seem like
you could have done decoder tricks here:

Left shifts by 1 bits could use ADD instead, shifts by 2 or 3 bits could
be converted to LEA's as long as the next instructions didn't use any
resulting flags (Yes, I know that this would have caused some
'interesting' situations when single-stepping through the code in a
debugger).

>>Why it should cause problems for MUL is less obvious (at least to me).
>
>
> IMUL is slow for a more prosaic reason: the dedicated integer multiplier
> was removed in a die diet. (You may have noticed that P4 is a
> pretty big die. This is *after* the die diets!) Most of the cost
> of IMUL is in shipping the bits across the die to the FP multiplier.
> This hurts both in latency as well as extra uops used in shipping
> the bits around. The extra uops mean IMUL is effectively depipelined,
> even though the multiplier itself is pipelined.

Double ouch! This is Back to Future (i.e. Pentium) with a vengeance:
Since all other operations have become much faster, the relative cost of
going back & forth between int and fp ALUs is even worse.

OTOH, it does leave Intel with a few obvious areas of improvement when
the core have gone through a shrink or two! :-)

Peter da Silva

unread,

Feb 21, 2003, 9:58:53 AM2/21/03

to

Andy Glew wrote:
> The increasing importance of multimedia, JPEG, MPEG, etc.
> have reduced branch misprediction rates, at least in terms
> of branch mispredictions per (thousand) instructions.

That surprises me. I would have thought that the branches taken in
a codec would depend heavily on the input, and for a good
compresser they would be more-or-less random, and fairly frequent.
--
`-_-' Har du kramat din varg, idag?
'U` Peter da Silva

Terje Mathisen

unread,

Feb 21, 2003, 12:18:07 PM2/21/03

to

Peter da Silva wrote:
> Andy Glew wrote:
>
>> The increasing importance of multimedia, JPEG, MPEG, etc.
>> have reduced branch misprediction rates, at least in terms
>> of branch mispredictions per (thousand) instructions.
>
>
> That surprises me. I would have thought that the branches taken in a
> codec would depend heavily on the input, and for a good compresser they
> would be more-or-less random, and fairly frequent.

For naive code: Yes!

However, since branches have been quite costly for a long time, sw
optimizers have figured out how to do many of these things in a
(semi-)branchless manner.

I.e. most of the tokens in a jpeg stream are small huffmann-encoded
numbers, which means that you can use the algorithms we discussed here a
month or two ago to extract variable length tokens totally without branches.

In fact, using the approach I came up with, you can average more than
one token per extract operation, still without any branching.

Jason Watkins

unread,

Feb 21, 2003, 2:35:56 PM2/21/03

to

Peter da Silva <pe...@abbnm.com> wrote in message news:<b35euo$2njn$1...@jeeves.eng.abbnm.com>...

> Andy Glew wrote:
> > The increasing importance of multimedia, JPEG, MPEG, etc.
> > have reduced branch misprediction rates, at least in terms
> > of branch mispredictions per (thousand) instructions.
>
> That surprises me. I would have thought that the branches taken in
> a codec would depend heavily on the input, and for a good
> compresser they would be more-or-less random, and fairly frequent.

For an entropy coder maybe. But even there the coder assigns symbols
according to frequency, so there likely is some cohereance for the
predictor to find. With media formats like jpeg, most of the work is
in the transformation loops, specificly doing FFT's, DCT's or the
like. Highly regular code.

Andy Glew

unread,

Feb 21, 2003, 7:27:51 PM2/21/03

to

> > IMUL is slow for a more prosaic reason: the dedicated integer multiplier

> > was removed in a die diet. ... Most of the cost

> > of IMUL is in shipping the bits across the die to the FP multiplier.
> > This hurts both in latency as well as extra uops used in shipping
> > the bits around. The extra uops mean IMUL is effectively depipelined,
> > even though the multiplier itself is pipelined.
>
> Double ouch! This is Back to Future (i.e. Pentium) with a vengeance:
> Since all other operations have become much faster, the relative cost of
> going back & forth between int and fp ALUs is even worse.

Yep. It's a perennial question: should your integer and floating point
datapaths be separate or unified? To whatever degree - you may not
do FP adds in something like Wmt's fast clocked integer ALUs, but
you might have them sitting right next door, on the same wire pitch,
able to use similar bypasses, etc.

Split means that you can design the FP and integer parts separately.

Unified means that some of the stuff you do to help integer automatically
helps FP, and vice versa.

Split doesn't help people who don't use FP. (Although, nowadays it isn't
really integer/FP - it's more like integer FP/multimedia - and some integer
codes can be rewritten to use FP or multimedia type instructions. But,
some important customers don't use FP at all.)

Unified allows the multiplier and divider hardware to be shared.
Split allows multipliers and dividers to be hand crafted for exactly
what they need to do.

Split penalizes shipping data between FP and integer. E.g. FP based
branches.
Unified may end up penalizing everything a bit, because both are slowed
down by having to cross over the other.

There seem to be three attractors towards which designs gravitate:

(I)(F): Split, with macho integer and macho FP.
(IF): Unified, with macho integer and FP
although usually slightly less macho than (I)(F)
(i)(F): Split, with integer penalized and macho FP

One project management issue is that (I)(F) both macho split
can easily become (i)(F) integer penalized split, or even
(i)(f) both integer and FP penalized split, when die diets happen.

Andy Glew

unread,

Feb 21, 2003, 7:34:12 PM2/21/03

to

> Andy Glew wrote:
> > The increasing importance of multimedia, JPEG, MPEG, etc.
> > have reduced branch misprediction rates, at least in terms
> > of branch mispredictions per (thousand) instructions.
>
> That surprises me. I would have thought that the branches taken in
> a codec would depend heavily on the input, and for a good
> compresser they would be more-or-less random, and fairly frequent.
>

> 'U` Peter da Silva

Vaiable length decode can be written in a "word filter" manner,
so the unpredictable branches can be converted into updating address
pointers by a variable amount, with table lookups.

The rest of *PEG decode is pretty predictable, data dominated.

Variable length encode and quantization can be branchy,
but even here lookup techniques can be used to avoid mispredictions.
Terje: wanna chime in?

The biggest part of MPEG encode is motion estimation.
Computation dominates.

Terje Mathisen

unread,

Feb 22, 2003, 7:10:16 AM2/22/03

to

Andy Glew wrote:
> Andy Glew wrote:
>
>>The increasing importance of multimedia, JPEG, MPEG, etc.
>>have reduced branch misprediction rates, at least in terms
>>of branch mispredictions per (thousand) instructions.
>
> That surprises me. I would have thought that the branches taken in
> a codec would depend heavily on the input, and for a good
> compresser they would be more-or-less random, and fairly frequent.
>
> 'U` Peter da Silva
>
>
> Vaiable length decode can be written in a "word filter" manner,
> so the unpredictable branches can be converted into updating address
> pointers by a variable amount, with table lookups.

This is exactly what I suggested some time ago.

>
> The rest of *PEG decode is pretty predictable, data dominated.
>
> Variable length encode and quantization can be branchy,
> but even here lookup techniques can be used to avoid mispredictions.
> Terje: wanna chime in?

Already did, about 6-8 hours before you asked me to. I guess NNTP still
isn't instantaneous across all servers worldwide, not even for the two
three important groups like c.a. :-)

>
> The biggest part of MPEG encode is motion estimation.
> Computation dominates.

MPEG encode is actually a really interesting optimization problem: Done
naively (fully brute force) would take effectively forever, but always
result in optimal encoding.

However, since the source material is really very much time-correlated
(after all, that's why we get such great compression rates), there are
all sorts of interesting approaches.

Off the top of my head:

- MIP-mapped images (i.e. lower resolution) to get approx. vectors.

- Memory:
+ Start close to the key image and search around it.
+ Remember all the vectors from the previous frame, use this
to calculate a probable location on the current. Can be
extended to higher-order functions as well.

- Scene changes: Detect when a particular clip ends, use this to
locate optimal placement for key frames.

Frode Vatvedt Fjeld

unread,

Feb 22, 2003, 7:33:44 AM2/22/03

to

Terje Mathisen <terje.m...@hda.hydro.com> writes:

> MPEG encode is actually a really interesting optimization problem:
> Done naively (fully brute force) would take effectively forever, but
> always result in optimal encoding.

I actually implemented optimal MV encoding once, only it was for H.261
rather than MPEG. H.261 has a much more restricted motion-vector space
than MPEG, but still it took far from forever to compute the
vectors. That was, IIRC, using 32 CPUs of one of the earliest PPro
generations. There were serious optimizations that could be done
withouth sacrificing optimalness.

--
Frode Vatvedt Fjeld

Terje Mathisen

unread,

Feb 22, 2003, 3:35:23 PM2/22/03

to

Frode Vatvedt Fjeld wrote:
> Terje Mathisen <terje.m...@hda.hydro.com> writes:
>
>
>>MPEG encode is actually a really interesting optimization problem:
>>Done naively (fully brute force) would take effectively forever, but
>>always result in optimal encoding.
>
>
> I actually implemented optimal MV encoding once, only it was for H.261
> rather than MPEG. H.261 has a much more restricted motion-vector space
> than MPEG, but still it took far from forever to compute the

'still'?

Having an order of magnitude (or more?) less search space pretty much
had to make the problem easier, not harder, right?
:-)

> vectors. That was, IIRC, using 32 CPUs of one of the earliest PPro
> generations. There were serious optimizations that could be done
> withouth sacrificing optimalness.

You could probably do all sorts of early pruning of the search tree,
as soon as a given area can be proved to not contain an improvement on
the currently best choice, right?

Terje
PS. It is nice to notice more norwegians in c.arch. :-)

Frode Vatvedt Fjeld

unread,

Feb 22, 2003, 5:28:36 PM2/22/03

to

Terje Mathisen <terje.m...@hda.hydro.com> writes:

> Having an order of magnitude (or more?) less search space pretty
> much had to make the problem easier, not harder, right?

Of course. I was in a hurry to catch the bus when I wrote that :-)

> You could probably do all sorts of early pruning of the search tree,
> as soon as a given area can be proved to not contain an improvement
> on the currently best choice, right?

Precisely. I'm not sure how easy it is to prove something about an
area regarding how many bits are required after DCT and entropy
encoding, although it might well be possible. But for each particular
vector examined, most of them will disqualify themselves rather soon
in the "encoding" process.

--
Frode Vatvedt Fjeld

Peter da Silva

unread,

Feb 22, 2003, 5:47:05 PM2/22/03

to

Andy Glew wrote:
> Vaiable length decode can be written in a "word filter" manner,
> so the unpredictable branches can be converted into updating address
> pointers by a variable amount, with table lookups.

Thanks, and thanks Terje and Jason.

Followup question: I had speculated that one reason the XScale
seemed to do so poorly on video was that the long pipeline was
screwing up the codecs. I know the cache and the relatively slow
bus was also an issue, but other memory intensive operations
didn't seem to take as big a hit.

Is this just a matter of the codecs being written for the
StrongARM, and so didn't spend so much effort on avoiding branches
because it had a shorter pipeline, or is something else going on?
--
`-_-' Ar rug tú barróg ar do mhactíre inniu?
'U` Peter da Silva

McCalpin

unread,

Feb 21, 2003, 8:01:24 AM2/21/03

to

In article <FKk5a.2547$4R1.45...@newssvr13.news.prodigy.com>,

Andy Glew <andy_gle...@yahoo.com> wrote:
>Branch mispredictions per (hundred) branches may not
>be getting that much better as workloads evolve
>(although they do as branch predictors evolve),
>but when there are more instructions, the performance impact
>is less. Also, cache misses dominate more and more,
>diluting the effect of branch mispredictions in codes where
>the mispredicted branch does not depend on the cache misses.

Yale Patt and I have a bit of a debate about this at some
conference (WWC-3, maybe?). He claimed that branch mispredictions
would be the dominant CPI component in the future, while I claimed
that limited sustained memory bandwidth would be the dominant CPI
component. (Note that memory latency is implicitly included in
sustained memory bandwidth.)

I still think I am right, though the efforts of many vendors
to increase sustainable memory bandwidth have made designs
more complex, and performance bottlenecks more subtle.

>Conversely, there are fewer array crunching codes.

A bold statement, though a potentially misleading one....

(Though the more detailed description below is certainly correct
for a significant part of the scientific/engineering application
space.)

>More and more scientific codes don't just loop
>for thousands and thousands of iterations, but instead
>are optimized, and use smart algorithms to skip parts
>of the arrays, divide the arrays into small linked chunks,
>etc. I.e. more and more scientific codes begin to look
>like integer codes - except that the computational interludes
>are still bug chunks of FP.

This is certainly one trend, and these codes are much more
difficult to analyze from a performance point of view. It is
difficult to determine cache hierarchy interactions and more
difficult to analyze IPC possibilities. It is much more
difficult to understand these codes from a traditional "count the
FP operations" point of view. Of course, the data dependence of
the control flow makes it difficult to draw any conclusions in a
static/data-independent sense, and this makes it easy to decide
to simply give up on the performance prediction game.

However, another trend which is equally valid is the shift from
multidimensional arrays to unstructured meshes (represented as a
one-dimensional array with a set of supporting index arrays to
point to the neighbors).

Unstructured mesh codes are in many ways more "vector-like" than
the structured mesh codes that the replace. For example, there is
no obvious concept of "cache tiling" for a 1-D indirect reference
stream.

One consequence of having both trends happen at the same time
is that it is even harder to figure out the "correct" design
point(s) for future hardware systems. Of course, that has the
desirable side effect of making jobs for people like me, so I
guess I should not complain too loudly....
--
John D. McCalpin, Ph.D. mcca...@austin.ibm.com
Senior Technical Staff Member IBM POWER Microprocessor Development
"I am willing to make mistakes as long as
someone else is willing to learn from them."

Terje Mathisen

unread,

Feb 23, 2003, 1:35:47 AM2/23/03

to

Probably just an optimization problem: With all the 'free' shifts in
ARM, it would be very tempting to handle all bit-level operations with
inline code instead of lookup tables, particularly since RAM is usually
more limited on these systems.

Michael S

unread,

Feb 23, 2003, 4:52:55 AM2/23/03

to

Peter da Silva <pe...@abbnm.com> wrote in message news:<b38uok$qm3$1...@jeeves.eng.abbnm.com>...

AFAIR, variable-length decoding accounts for less than 10% of the
total CPU time for both JPEG and MPEG1. And that in presence of HW
floating point. For the fix-point machine with relatively slow
multiply like XScale/StrongARM the decoding should be dominated by
IDCT even more.