Banias - the last hurrah of the high-IPC in mobile x86 ?

Michael S

unread,

Mar 24, 2003, 11:52:22 AM3/24/03

to

Is a Banias the last hurrah of the high-IPC approach in mobile
computing ? I think so.

Here the reasons:
A Banias looks like very competent design. The design team used most
known tricks to get the most out of each Watt and invented several new
tricks as well. There are rumors that some design ideas are so novice
and useful that Intel even afraid to patent them. Despite all this -
the chip is hot. The thing dissipates 25Wt @1.6GHz ! AFAIR it's about
the same as 21064 when originally introduced. And 21064 was considered
way too hot for a desktop back then.

Let's remind a challenges chip architects would face with even smaller
geometries and lower voltage.

1. Increase in dynamic-to-leakage power ratio.
A translation - non-switching logic gates are almost as bad as a
switching ones. A solution - reduce a number of gate.

2. Interconnects dominates clock budget.
A translation - taking things from here to there cost you. A solution
- use results close to the point of producing it. Pipeline bypass is
better than register read port.

3. Clock skew contributes to clock budget more than ever.
Possible solutions:
A. Clock slower.
B. Use source synchronization.

Both 1, 2 and 3B lead to fast cycle time and long pipelines. Fast
cycle time increases a penalty of waiting for the load. Long pipeline
increases a penalty of the mispredicted branch. Both penalties a bad
because precise power is wasted. Is the better way to fight the
problem than reduced parallelism ? May be. I don't know about it.

So the next mobile CPU according to my prediction will have:
1. Longer pipelines
2. Faster cycle time (it's actually too obvious to even mention)
3. Narrower design. Especially it will avoid redundant execution
units.

I don't know much about Banias internals. However we can guess that
several basic characteristics are inherited from P6. Those are:
3-way decoding front end.
OoO execution engine capable to issue 2OP+load+store on every cycle.
3-way retirement.

I would expect something like this in the next mobile CPU:
2-way decoding front end or 3-uOP wide trace cache. I don't know if a
trace cache is right way to go in a low-power design.
OoO execution engine capable to issue 2OP+ load _or_ store on every
cycle. However of 2 OPs one has to go to the fast integer ALU. Only
one FP/SIMD OP per cycle is allowed.
2 or 3-way retirement.

The OoO issue would be further restricted by two read ports of the
register file.

Double-precision FP will have lower throughput than the rest of the
units - 1/2 DP-FAdd/cycle, 1/2 or 1/3 DP-FMul/cycle.

Paul DeMone

unread,

Mar 24, 2003, 1:32:08 PM3/24/03

to

Michael S wrote:

> Let's remind a challenges chip architects would face with even smaller
> geometries and lower voltage.
>
> 1. Increase in dynamic-to-leakage power ratio.
> A translation - non-switching logic gates are almost as bad as a
> switching ones. A solution - reduce a number of gate.

If being smaller by several orders of magnitude is "almost as bad"
then I guess I had have to agree.

Also leakage power is strongly influenced by circuit topology.
Smarter designer can yield significant saving in leakage power
without reducing functionality.

> 2. Interconnects dominates clock budget.

It would if you designed a really bad microprocessor. But the
experienced MPU design teams that don't stratify or serialize
logic, circuit, and physical design seem to be successful in
keeping the proportion of interconnect delay on their critical
paths low.

> 3. Clock skew contributes to clock budget more than ever.

Compare papers on the design of the Willamette and Prescott.
It appears that clock skew is a smaller proportion of the clock
period of the newer chip unless it eventually clocks *far* faster
than current predictions.

--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
pde...@igs.net architectures with MIPSed results but ALPHA's well
that ends well.

del cecchi

unread,

Mar 24, 2003, 7:39:39 PM3/24/03

to

"Paul DeMone" <pde...@igs.net> wrote in message
news:3E7F4F28...@igs.net...

>
>
> Michael S wrote:
>
> > Let's remind a challenges chip architects would face with even
smaller
> > geometries and lower voltage.
> >
> > 1. Increase in dynamic-to-leakage power ratio.
> > A translation - non-switching logic gates are almost as bad as a
> > switching ones. A solution - reduce a number of gate.
>
> If being smaller by several orders of magnitude is "almost as bad"
> then I guess I had have to agree.
>
> Also leakage power is strongly influenced by circuit topology.
> Smarter designer can yield significant saving in leakage power
> without reducing functionality.
>

snip
Hmm, Gordon Moore highlighted leakage as a big problem at ISSCC, as did
several other authors. And "several orders of magnitude" is not the
case for the aggressive microprocessor guys.

> Compare papers on the design of the Willamette and Prescott.
> It appears that clock skew is a smaller proportion of the clock
> period of the newer chip unless it eventually clocks *far* faster
> than current predictions.

On Itanium, the numbers I recall are about 25ps for I1, 60ps for I2, and
7ps (with a grain of salt) for I3. Again from Intel folks at ISSCC
>
del cecchi
>

Paul DeMone

unread,

Mar 24, 2003, 9:21:33 PM3/24/03

to

del cecchi wrote:

> "Paul DeMone" <pde...@igs.net> wrote in message
> news:3E7F4F28...@igs.net...
> >
> >
> > Michael S wrote:
> >
> > > Let's remind a challenges chip architects would face with even
> smaller
> > > geometries and lower voltage.
> > >
> > > 1. Increase in dynamic-to-leakage power ratio.
> > > A translation - non-switching logic gates are almost as bad as a
> > > switching ones. A solution - reduce a number of gate.
> >
> > If being smaller by several orders of magnitude is "almost as bad"
> > then I guess I had have to agree.
> >
> > Also leakage power is strongly influenced by circuit topology.
> > Smarter designer can yield significant saving in leakage power
> > without reducing functionality.
> >
> snip
> Hmm, Gordon Moore highlighted leakage as a big problem at ISSCC, as did
> several other authors. And "several orders of magnitude" is not the
> case for the aggressive microprocessor guys.

I am not saying the trend isn't real and or not a challenge. But there is a
big
difference between saying something is a growing problem with each new
process generation and making a specific erroneous claim in the context
of a discussion of microarchitectural merits of two different 130 nm MPUs.

As far as the several orders of magnitude figure that was with regards to an

individual logic gate either switching or not switching. That is different
from
the relative fraction of leakage and dynamic power in an entire MPU,
aggressive or otherwise.

>
>
> > Compare papers on the design of the Willamette and Prescott.
> > It appears that clock skew is a smaller proportion of the clock
> > period of the newer chip unless it eventually clocks *far* faster
> > than current predictions.
>
> On Itanium, the numbers I recall are about 25ps for I1, 60ps for I2, and
> 7ps (with a grain of salt) for I3. Again from Intel folks at ISSCC

AFAIK the 7 ps figure is for Prescott. For clock skew to be growing as
a portion of timing budget, at least in the P4 line, the Prescott would have

to be clocked at roughly 18+ GHz, far higher than the 4 to 5 GHz
estimates for it.

Michael S

unread,

Mar 25, 2003, 4:42:05 AM3/25/03

to

Paul DeMone <pde...@igs.net> wrote in message news:<3E7FBD2D...@igs.net>...

> del cecchi wrote:
>
> > "Paul DeMone" <pde...@igs.net> wrote in message
> > news:3E7F4F28...@igs.net...
> > >
> > >
> > > Michael S wrote:
> > >
> > > > Let's remind a challenges chip architects would face with even
> > > > smaller geometries and lower voltage.
> > > >
> > > > 1. Increase in dynamic-to-leakage power ratio.
> > > > A translation - non-switching logic gates are almost as bad as a
> > > > switching ones. A solution - reduce a number of gate.
> > >
> > > If being smaller by several orders of magnitude is "almost as bad"
> > > then I guess I had have to agree.
> > >
> > > Also leakage power is strongly influenced by circuit topology.
> > > Smarter designer can yield significant saving in leakage power
> > > without reducing functionality.
> > >
> > snip
> > Hmm, Gordon Moore highlighted leakage as a big problem at ISSCC, as did
> > several other authors. And "several orders of magnitude" is not the
> > case for the aggressive microprocessor guys.
>
> I am not saying the trend isn't real and or not a challenge. But there is a
> big difference between saying something is a growing problem with each new
> process generation and making a specific erroneous claim in the context
> of a discussion of microarchitectural merits of two different 130 nm MPUs.
>

Oh, sorry. I didn't make it clear.
Originally I wanted to call the post "The tradeoffs of
microarchitecture for mobile computers of 2006." Then I went for more
controversial header and 2006 slept. Of coarse, I didn't mean 130nm or
even 90nm. The speculations were about next uArch for mobile x86 (and
other lap top class CPUs).
I believe that IPC chosen for Banias is near optimal at the targeted
feature size (i.e. .13 and .09). The next uArch (if somebody would
have a resources to do it) will have to deal with a challenge of even
smaller geometry and even lower voltage.

>
> As far as the several orders of magnitude figure that was with regards to an
> individual logic gate either switching or not switching. That is different
> from the relative fraction of leakage and dynamic power in an entire MPU,
> aggressive or otherwise.
>

You are right. Thank you for correction.
The correct statement is about logic blocks, not individual gates.
Since even in active logic block most transistors are not switching,
the ratio between power consumptions of active logic block and idle
(but not shouted down) logic block does reduce significantly with
smaller features/lower voltage.

There are a counterexamples like low voltage, low leakage logic
applied by TI in the MPC430. However MPC430 is incredibly slow.

>
> > > Compare papers on the design of the Willamette and Prescott.
> > > It appears that clock skew is a smaller proportion of the clock
> > > period of the newer chip unless it eventually clocks *far* faster
> > > than current predictions.
> >
> > On Itanium, the numbers I recall are about 25ps for I1, 60ps for I2, and
> > 7ps (with a grain of salt) for I3. Again from Intel folks at ISSCC
>
> AFAIK the 7 ps figure is for Prescott. For clock skew to be growing as
> a portion of timing budget, at least in the P4 line, the Prescott would have
> to be clocked at roughly 18+ GHz, far higher than the 4 to 5 GHz
> estimates for it.

How do chip designers improve clock skew ? I can think about following
means:
1) Improve design layout. Match physical length of the branches with
better precision.
2) Improve load balancing by insertion of repeaters.
3) Improve load balancing by matching the passive load of the
branches.
4) Use stronger clock drivers.

(1) is applicable to any design expecting high volumes in production.
(2) and 4) don't sound attractive for a low-power chip. (3) can be
hard to apply if you want to switch clocks on and off on basis of
relatively small modules.
Overall, it looks like it is harder to get low global clock skew in
low power design.

Del Cecchi

unread,

Mar 25, 2003, 1:14:23 PM3/25/03

to

In article <3E7FBD2D...@igs.net>,
Paul DeMone <pde...@igs.net> writes:
|>
|>

snip

I'm not up on the code names, but you are correct that the paper 19.7 at ISSCC
was discussing a 90 nm IA (I presume that means IA32) part with a maximum
frequency of 7 GHz. I got that crossed with the papers about Itanium3. Sorry
for any confusion.

Quite an interesting paper, even if my processor design buddies aren't impressed.
:-)

--

Del Cecchi
cec...@us.ibm.com
Personal Opinions Only

Paul DeMone

unread,

Mar 25, 2003, 3:04:30 PM3/25/03

to

Del Cecchi wrote:

> [...]

> I'm not up on the code names, but you are correct that the paper 19.7 at ISSCC
> was discussing a 90 nm IA (I presume that means IA32) part with a maximum
> frequency of 7 GHz. I got that crossed with the papers about Itanium3. Sorry
> for any confusion.
>
> Quite an interesting paper, even if my processor design buddies aren't impressed.
> :-)

Give them time. According to a paper in the IBM tech journal a few years
back some of your processor buddies were just starting to figure out the
deeper technical lessons of the EV6 implementation five years after it was
publically disclosed.

Del Cecchi

unread,

Mar 25, 2003, 4:17:07 PM3/25/03

to

In article <3E80B64E...@igs.net>,

Paul DeMone <pde...@igs.net> writes:
|>
|>
|> Del Cecchi wrote:
|>
|> > [...]
|> > I'm not up on the code names, but you are correct that the paper 19.7 at ISSCC
|> > was discussing a 90 nm IA (I presume that means IA32) part with a maximum
|> > frequency of 7 GHz. I got that crossed with the papers about Itanium3. Sorry
|> > for any confusion.
|> >
|> > Quite an interesting paper, even if my processor design buddies aren't impressed.
|> > :-)
|>
|> Give them time. According to a paper in the IBM tech journal a few years
|> back some of your processor buddies were just starting to figure out the
|> deeper technical lessons of the EV6 implementation five years after it was
|> publically disclosed.
|>

I guess they have figgerd out a couple of things in the meantime. And which
journal are you referring to? The fluff book that comes out with a new machine, the J
of R&D, IBM Research Journal?

Gee, I haven't taken a shot at Alpha in a long time. Does this mean I get a
freebie?

Paul DeMone

unread,

Mar 25, 2003, 5:50:45 PM3/25/03

to

Del Cecchi wrote:

> In article <3E80B64E...@igs.net>,
> Paul DeMone <pde...@igs.net> writes:

> [...]

> |> > Quite an interesting paper, even if my processor design buddies aren't impressed.
> |> > :-)
> |>
> |> Give them time. According to a paper in the IBM tech journal a few years
> |> back some of your processor buddies were just starting to figure out the
> |> deeper technical lessons of the EV6 implementation five years after it was
> |> publically disclosed.
> |>
> I guess they have figgerd out a couple of things in the meantime. And which
> journal are you referring to? The fluff book that comes out with a new machine, the J
> of R&D, IBM Research Journal?

Allen, D.H. et al, "Custom Circuit Design as a Driver of Microprocessor
Performance", IBM Journal of Research and Development, Vol 44., No. 6
Nov 2000.

"..we expect cycle times, as measured in fanout-4 inverter delays, to dip
below 20 in the near future. We have shown by prototyping an entire short
pipe processor and the re-order structures needed to implement a more
complex microarchietcture, that cycle times are feasible today if the
circuits, logic, floorplan, and microarchitecture are designed hand in hand"

In other words IBM geniuses figured out how to implement an EV6 like
design using EV6 style logic/circuit/physical design synergy 3 or 4 years
after DEC published detailed descriptions of the EV6. Not that they got
anywhere close, the actual EV6 had a cycle time on the order of 12 logic
delays.

>
> Gee, I haven't taken a shot at Alpha in a long time. Does this mean I get a
> freebie?

You know the expression about beating dead horses? :-(

del cecchi

unread,

Mar 25, 2003, 9:50:30 PM3/25/03

to

"Paul DeMone" <pde...@igs.net> wrote in message

news:3E80DD45...@igs.net...
snip

>
> Allen, D.H. et al, "Custom Circuit Design as a Driver of
Microprocessor
> Performance", IBM Journal of Research and Development, Vol 44., No. 6
> Nov 2000.
>
> "..we expect cycle times, as measured in fanout-4 inverter delays, to
dip
> below 20 in the near future. We have shown by prototyping an entire
short
> pipe processor and the re-order structures needed to implement a more
> complex microarchietcture, that cycle times are feasible today if the
> circuits, logic, floorplan, and microarchitecture are designed hand in
hand"
>
> In other words IBM geniuses figured out how to implement an EV6 like
> design using EV6 style logic/circuit/physical design synergy 3 or 4
years
> after DEC published detailed descriptions of the EV6. Not that they
got
> anywhere close, the actual EV6 had a cycle time on the order of 12
logic
> delays.
>
> >
> > Gee, I haven't taken a shot at Alpha in a long time. Does this mean
I get a
> > freebie?
>
> You know the expression about beating dead horses? :-(
>

He was talking about the Star family, designed by about 20 guys on the
tundra. It wasn't even that custom. Certainly wasn't one of those
"optimize every single transistor even if it takes 5 years" jobs.

I don't beat dead horses. I have however been known to attempt to kick
a dead whale down the beach.

del cecchi

Andy Glew

unread,

Mar 26, 2003, 12:56:25 AM3/26/03

to

> In other words IBM geniuses figured out how to implement an EV6 like
> design using EV6 style logic/circuit/physical design synergy 3 or 4 years
> after DEC published detailed descriptions of the EV6. Not that they got
> anywhere close, the actual EV6 had a cycle time on the order of 12 logic
> delays.

Give it a break, Paul.

I happen to know that one of EV8's "great new ideas" was a P6-style
register renamer - the quoted words being almost exactly those of an
EV8 team member. 10 years after P6 started, 6-7 years after P6 shipped.
Heck, I probably posted about HaRRM to this newsgroup while I was
at Illinois, circa 1987-1991. I know I mentioned it to DEC Alpha people
when they visited Ilinois.

Design teams have different styles. Different priorities. Different
tradeoffs.
The IBM design team(s) are still viable, shipping good chips.

David Kanter

unread,

Mar 27, 2003, 12:46:18 AM3/27/03

to

> I happen to know that one of EV8's "great new ideas" was a P6-style
> register renamer - the quoted words being almost exactly those of an
> EV8 team member. 10 years after P6 started, 6-7 years after P6 shipped.
> Heck, I probably posted about HaRRM to this newsgroup while I was
> at Illinois, circa 1987-1991. I know I mentioned it to DEC Alpha people
> when they visited Ilinois.

What is HaRRM?

> Design teams have different styles. Different priorities. Different
> tradeoffs.
> The IBM design team(s) are still viable, shipping good chips.

Very true.

David Kanter

Singh, S.R.

unread,

Mar 27, 2003, 1:59:40 AM3/27/03

to

David Kanter wrote:

> What is HaRRM?

(a) Hardware Register Renaming Mechanism. OoO Research that
Mr. Glew worked on while doing his MSc, I believe.

--
Singh, S.R. swaranrajsingh |at| hotmail |dot| com

If someone gives me a job, I can put,
"Opinions not of my employer" here.