P4/Netburst architecture is dead

46 views
Skip to first unread message

Douglas Siebert

unread,
May 7, 2004, 4:58:10 PM5/7/04
to
Intel confirmed information The Inquirer had written several months ago,
they are cancelling the P4 (effective with the next rev that was supposed
to be out in 2005, Tejas) in favor of their P6 architecture based Pentium
M core, for both desktops and servers. I would assume Prescott will stick
around long enough for them to get a 64 bit enabled Pentium M core ready.
Since MS is now looking at Q4 for SP2/Win64, they may not miss the market
by much on this.

The talk now is of dual Pentium M cores on desktop CPUs by the end of
next year. That makes sense, they are low power enough and small enough
to work well. One wonders what will become of hyperthreading? With the
ability to do dual core CPUs in 90nm and probably quad core in 65nm, it
doesn't look like there will be much need for all the work that would be
required to retrofit Pentium M for HT, not unless they can get a larger
performance benefit than HT was for the P4. Doesn't seem to be any word
yet on what will happen to the BTX form factor they were pushing to manage
all the heat output from Prescott and especially Tejas, and the migration
to socket 775 they are attempting to jumpstart soon. I'll bet it'll be
another dead end like socket 423.

Whatever anyone may say about AMD's impact on the marketplace in terms
of market share, they surely seem to be having an affect on the market
in terms of making Intel dance so far in 2004!

IA64 aficionados may want to note that The Inquirer mentioned today that
they are now hearing rumblings that some parts of the IA64 roadmap are
getting cancelled as well. Guess we'll see if their sources for that
are as good as their sources for the P4's cancellation were proven to be.

--
Douglas Siebert dsie...@excisethis.khamsin.net

When hiring, avoid unlucky people, they are a risk to the firm. Do this by
randomly tossing out 90% of the resumes you receive without looking at them.

Robert Myers

unread,
May 7, 2004, 8:10:38 PM5/7/04
to
Douglas Siebert wrote:

<snip>

>
> Whatever anyone may say about AMD's impact on the marketplace in terms
> of market share, they surely seem to be having an affect on the market
> in terms of making Intel dance so far in 2004!
>

AMD gets the credit, or physics?

In any case, just the time for an upbeat puff piece on Craig Barrett

http://msnbc.msn.com/id/4892329/

from which it would be hard to tell that the end of the road for Intel's
current cash cow is even a pebble in Intel's shoe. "It Isn’t Just About
the PC," the article tells us, "Making microchips for the aging PC
industry isn’t a windfall anymore." Neither is making microchips for
servers, to make a reasonable inference from the chart that accompanies
that text.

To narrow the search down a bit further (yes, I've been cheating and
using Google news search on "Intel"), I tried "Intel and leakage." More
bad news not attributable to AMD, and not to heat, either: Soft errors
are apparently back in style:

http://www.eetimes.com/semi/news/showArticle.jhtml?articleID=19400052

or, if you don't want to be nailed down as to just what it is that is
causing the problems, "signal integrity:"

http://arstechnica.com/news/posts/1083010432.html

INTC up $0.49 on the day. Guess the news wasn't as bad as the Street
expected, or it hasn't yet really figured out what's going on.

RM

Andrew Reilly

unread,
May 7, 2004, 9:47:03 PM5/7/04
to
On Fri, 07 May 2004 20:58:10 +0000, Douglas Siebert wrote:

> Intel confirmed information The Inquirer had written several months ago,
> they are cancelling the P4 (effective with the next rev that was supposed
> to be out in 2005, Tejas) in favor of their P6 architecture based Pentium
> M core, for both desktops and servers.

Are you sure that the Inquirer isn't quoting one N. Maclaren, from this
very august journal?

> The talk now is of dual Pentium M cores on desktop CPUs by the end of
> next year.

Yep. That's our Nick...

--
Andrew

del cecchi

unread,
May 7, 2004, 10:24:55 PM5/7/04
to

"Robert Myers" <rmyer...@comcast.net> wrote in message
news:2AVmc.1030$iF6.152879@attbi_s02...

>
> or, if you don't want to be nailed down as to just what it is that is
> causing the problems, "signal integrity:"
>
> http://arstechnica.com/news/posts/1083010432.html
>
> INTC up $0.49 on the day. Guess the news wasn't as bad as the Street
> expected, or it hasn't yet really figured out what's going on.
>
> RM
>

Or they will save bucks by using one core design instead of two. more
EPS.

The arstechnica article was pretty specific about signal integrity from
coupling and stuff being a problem, and I didn't see any mention of soft
errors. They really aren't much of a problem in logic, although latch
design needs to be careful. And Arrays just have to waste a little
space to keep the Qcrit up or put in ECC, which they should have anyway.

Look into the effect of adjacent wires on capacitance. Now figure that
moving the aggressor doubles the effect. And add that since the wire is
very narrow, less than 200 nm, and there might be 8 levels, the
capacitance to substrate(ground) is quite low.

del cecchi


Robert Myers

unread,
May 7, 2004, 11:46:48 PM5/7/04
to
del cecchi wrote:

<snip>

>
> The arstechnica article was pretty specific about signal integrity from
> coupling and stuff being a problem, and I didn't see any mention of soft
> errors. They really aren't much of a problem in logic, although latch
> design needs to be careful. And Arrays just have to waste a little
> space to keep the Qcrit up or put in ECC, which they should have anyway.
>

You have a bundle of experience at your disposal that I don't.
Everybody seems to be experiencing unpleasant surprises moving to 90nm.
I don't have a way of evaluating whether the unpleasant surprises
people are experiencing are what should be truly regarded as surprises
or just another day at the office for people working at the smallest
scales in production.

I do have some experience with how creative people can be in making
explanations when things start to go wrong. :-).

RM

Douglas Siebert

unread,
May 8, 2004, 3:50:49 AM5/8/04
to
Robert Myers <rmyer...@comcast.net> writes:

>Douglas Siebert wrote:

><snip>

>>
>> Whatever anyone may say about AMD's impact on the marketplace in terms
>> of market share, they surely seem to be having an affect on the market
>> in terms of making Intel dance so far in 2004!
>>

>AMD gets the credit, or physics?


Depends on how you look at it. Physics played a role, but if AMD was
not as strong of a second fiddle as they are right now, and their best
effort was only say a 2400+ or so, then Intel wouldn't have any need to
push so hard for faster stuff. They could just add 100MHz every quarter
and keep well ahead of AMD and not worry about what happens when they
hit 4 or 5 GHz. So yes, I think AMD plays a part in this, by managing
to not have bankrupted themselves as many expected would happen in the
pre-Athlon days. Just like if AMD hadn't done a 64 bit part, Intel would
never have done a 64 bit x86 extension, and would have successfully
pushed everyone to IA64 over the next five years just by Moore's Law
expanding the minimum DRAM beyond the limit of a 32 bit OS on x86.

Look at it compared to MS. They upgrade the heck out of their software
when there are competitors. They may not produce quality, but they are
damn good at quantity in terms of features! Once the competition is
gone, it is just in maintenance mode. Compare all the upgrades to IE
back when Netscape was a threat, versus the last few years when they are
a tired brand name barely remembered by the masses.

Stephen Sprunk

unread,
May 8, 2004, 4:51:12 AM5/8/04
to
"Robert Myers" <rmyer...@comcast.net> wrote in message
news:2AVmc.1030$iF6.152879@attbi_s02...
> INTC up $0.49 on the day. Guess the news wasn't as bad as the Street
> expected, or it hasn't yet really figured out what's going on.

I see it as good news: Intel has reduced the number of parallel development
teams, which should result in higher EPS -- assuming they can get dual-core
and 64-bit PM chips out before the P4 line comes to an end.

Innovation is usually a positive, but in the wake of the long-running
Itanium fiasco and mounting scaling problems with NetBurst, the Street has
to be pushing for Intel to start following the strategies of other companies
(e.g. AMD and IBM) that have had more success.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

Yousuf Khan

unread,
May 8, 2004, 11:58:19 AM5/8/04
to
Robert Myers wrote:
> INTC up $0.49 on the day. Guess the news wasn't as bad as the
> Street expected, or it hasn't yet really figured out what's going
> on.

Likely the latter.

Yousuf Khan

Klaus Fehrle

unread,
May 8, 2004, 12:19:31 PM5/8/04
to
Firstnam...@tiscali.co.uk
"Douglas Siebert" <dsie...@excisethis.khamsin.net> schrieb im Newsbeitrag
news:c7gt91$e08$1...@narsil.avalon.net...


<snip >


> The talk now is of dual Pentium M cores on desktop CPUs by the end of
> next year. That makes sense, they are low power enough and small enough
> to work well.

Well, for the timefram it would make sense if they started working on it
quite a while ago.
If they only begin now, end of next year sounds like a very, very ambitious
target, even
just for a Dual-Core Dothan design. If Intel intended to implement 64-bit
capabilities
and a memory controller, it would appear nothing short of impossible to have
it ready
by end of next year. Let alone manufacturable for volume.

One wonders what will become of hyperthreading? With the
> ability to do dual core CPUs in 90nm and probably quad core in 65nm, it
> doesn't look like there will be much need for all the work that would be
> required to retrofit Pentium M for HT, not unless they can get a larger
> performance benefit than HT was for the P4.

Multicore CPUs allow for better parallelization than Hyperthreading, without
its downsides.
As you said, no need for all the work.

Doesn't seem to be any word
> yet on what will happen to the BTX form factor they were pushing to manage
> all the heat output from Prescott and especially Tejas, and the migration
> to socket 775 they are attempting to jumpstart soon. I'll bet it'll be
> another dead end like socket 423.

Well, as socket 775 and Prescott is all they have for the next two years,
that would
make just an average lifetime of Intel-platforms.

> Whatever anyone may say about AMD's impact on the marketplace in terms
> of market share, they surely seem to be having an affect on the market
> in terms of making Intel dance so far in 2004!

2004 it will only be slow waltz to dance for Intel.
2005 it will be Cha-cha-cha. Nothing much impressive in mss-terms, but in
terms of ASP and earnings.
2006, MSS-Foxtrott will be played when Fab30 is on capacity of one or two
mature 90nm processes.
Pace of this dance will accelerate while Fab-36 will be ramping.
2007, Fab36 could be on capacity of 5000 300mmWSPW already.
Better Intel has a competitive design again by then. Otherwise Tango will be
the dance.

KF


Mike Haertel

unread,
May 8, 2004, 3:11:47 PM5/8/04
to
On 2004-05-08, Stephen Sprunk <ste...@sprunk.org> wrote:
> Intel has reduced the number of parallel development
> teams, which should result in higher EPS

The cost of an extra design team here and there is just a drop
in Intel's bucket of $$$$, lost in the noise compared to their
infrastructure and production costs.

Douglas Siebert

unread,
May 8, 2004, 4:24:48 PM5/8/04
to
"Klaus Fehrle" <nos...@t-online.de> writes:

>> The talk now is of dual Pentium M cores on desktop CPUs by the end of
>> next year. That makes sense, they are low power enough and small enough
>> to work well.

>Well, for the timefram it would make sense if they started working on it
>quite a while ago.
>If they only begin now, end of next year sounds like a very, very ambitious
>target, even
>just for a Dual-Core Dothan design. If Intel intended to implement 64-bit
>capabilities
>and a memory controller, it would appear nothing short of impossible to have
>it ready
>by end of next year. Let alone manufacturable for volume.


That depends on how long ago they started on it. The Inquirer had rumors
of a 64 bit Pentium M skunkworks project since the beginning of the year,
and if true, who knows how long they would have been working on it. I
doubt Intel would announce such a major change in strategy without having
some fairly good ideas of the timelines involved and some initial work
done to prove them. There are always some delays you don't plan on, but
that's a problem hardly limited to Intel.

I haven't heard anything remotely concrete about an on-die memory
controller for Intel, other than claims about it being the real reason
for the 775 pins in the new socket. For the dual Pentium M, there's no
reason they'd have to do that, certainly not in their first iteration.
After all, Pentium Ms run pretty well with 1600 MB/s FSB in today's
laptops. Give them the 6.4 GB/s FSB in today's P4s, or by the end of
2005 8 GB/s or even 9.6 GB/s for DDR2-553 or DDR2-667, and I think those
two cores, even if they were twice as fast as today's top end Pentium
Ms, would be quite well fed memory wise. They don't need the bandwidth
a P4 does since they don't have the long pipeline and extra high clock
rates.

I think on die memory controllers get more interesting for Intel if/when
they get FBDRAM going to allow for plenty of memory channels for those
quad core CPUs they will probably be making in 65nm, without needing 2000
pin packages. And of course if they push FBDRAM I think AMD would surely
follow on that, if they don't end up explicitly partnering with Intel and
others to make it happen.

Felger Carbon

unread,
May 8, 2004, 5:04:01 PM5/8/04
to
"Douglas Siebert" <dsie...@excisethis.khamsin.net> wrote in message
news:c7jfmg$ah0$1...@narsil.avalon.net...

>
> I haven't heard anything remotely concrete about an on-die memory
> controller for Intel, other than claims about it being the real
reason
> for the 775 pins in the new socket. For the dual Pentium M, there's
no
> reason they'd have to do that, certainly not in their first
iteration.
> After all, Pentium Ms run pretty well with 1600 MB/s FSB in today's
> laptops. Give them the 6.4 GB/s FSB in today's P4s, or by the end
of
> 2005 8 GB/s or even 9.6 GB/s for DDR2-553 or DDR2-667, and I think
those
> two cores, even if they were twice as fast as today's top end
Pentium
> Ms, would be quite well fed memory wise. They don't need the
bandwidth
> a P4 does since they don't have the long pipeline and extra high
clock
> rates.

I believe the main point of on-die memory controllers is reduced
latency, not improved bandwidth.


> I think on die memory controllers get more interesting for Intel
if/when
> they get FBDRAM going to allow for plenty of memory channels for
those
> quad core CPUs they will probably be making in 65nm, without needing
2000
> pin packages.

You seem to believe that each core on the quad-core chip will have an
independent memory controller/channel. While I have no definite
information to the contrary, it seems unlikely. Am I missing
something here?


Stephen Sprunk

unread,
May 8, 2004, 5:23:28 PM5/8/04
to
"Felger Carbon" <fms...@jfoops.net> wrote in message
news:5Xbnc.12569$Hs1....@newsread2.news.pas.earthlink.net...

> "Douglas Siebert" <dsie...@excisethis.khamsin.net> wrote in message
> news:c7jfmg$ah0$1...@narsil.avalon.net...
> > I think on die memory controllers get more interesting for Intel
> > if/when they get FBDRAM going to allow for plenty of memory channels
> > for those quad core CPUs they will probably be making in 65nm, without
> > needing 2000 pin packages.
>
> You seem to believe that each core on the quad-core chip will have an
> independent memory controller/channel. While I have no definite
> information to the contrary, it seems unlikely. Am I missing
> something here?

AMD's K8 sports one memory controller with two core interfaces. Given
Intel's new strategy of copying AMD, that's probably what the PM will end up
with :-)

Having one controller per core also implies NUMA within a single chip, which
must bring "interesting" performance implications.

Google doesn't turn up much on FBDRAM, but it appears it'll have the same
pincount as DDR2, so I doubt we'll be getting past two channels (per chip)
any time soon, regardless of how many cores we can cram into a die.

Andy Glew

unread,
May 8, 2004, 5:55:23 PM5/8/04
to

"Douglas Siebert" <dsie...@excisethis.khamsin.net> wrote in message
news:c7gt91$e08$1...@narsil.avalon.net...

> Intel confirmed information The Inquirer had written several months ago,
> they are cancelling the P4 (effective with the next rev that was supposed
> to be out in 2005, Tejas) in favor of their P6 architecture based Pentium
> M core, for both desktops and servers.

Although, as one of the P6 architects, I might be happy to hear this,
I am not sure that all of these extrapolations are true.

The Tejas project has been in trouble for years - IMHO beginning with
when they decided not to make an out-of-order x86 chip that could also
run Itanium code (by converting the Itanium VLIW to uops that could run
OOO). Tejas lost people all over the place, such as McDermott, who
set up the Intel Austin facility, Sprangle (who left Texas for Intel Oregon,
I'm guessing when it became apparent that Tejas was not going to do
any new microarchitecture), Brad Burgess and Doug Beard (now at AMD).
I think even Marvin Denman is gone.

Also, it became obvious that Intel's Texas Design Center was not going to be
the next major processor group, when Intel acquired the Massachusetts
Alpha 21464 group that is now doing Tanglewood/Tukwila. Given the sorry
state
of Itanic, I wonder when we will hear an announcement involving THAT group.

Prescott's problems may just have been the last straw leading to the
cancellation
of Tejas.

I'm not so sure that we shoukd rule the Pentium 4 microarchitecture dead
yet,
though. Or, at least, some of its key ideas are still valid: mainly,
eliminate unnecessary
logic to make things run fast. Personally, I think the aggressive circuit
stuff for the
fireball was overkill.

The Willamette/Pentium 4 microarchitecture, IMHO, had some good ideas
mixed up with a whole slew of bad ones. Its main badness was
design-by-committee.
Many people I know at Intel said that Willamette was a failure by the CPU
architects,
rescued by heroic circuit and process engineers.

However, it would be a pity if the good ideas were dragged down by the bad
ones.

===

Moreover, exercising my paranoia: I'm not so sure that I want *MY* employer
to draw unjustified conclusions from the cancelation of Tejas.


Stephen Sprunk

unread,
May 8, 2004, 6:27:38 PM5/8/04
to
"Andy Glew" <glew2pub...@sbcglobal.net> wrote in message
news:fHcnc.46320$dJ3....@newssvr29.news.prodigy.com...

> The Willamette/Pentium 4 microarchitecture, IMHO, had some good ideas
> mixed up with a whole slew of bad ones. Its main badness was
> design-by-committee. Many people I know at Intel said that Willamette was
> a failure by the CPU architects, rescued by heroic circuit and process
engineers.
>
> However, it would be a pity if the good ideas were dragged down by the bad
> ones.

Is the trace cache something that (a) could and (b) should be retrofitted
onto the PM core, or is its decoder fast enough not to affect the critical
path? That's one of the few ideas in the P4 core I thought had a lot of
promise...

Yousuf Khan

unread,
May 8, 2004, 7:21:39 PM5/8/04
to
Douglas Siebert wrote:

> Robert Myers <rmyer...@comcast.net> writes:
>> AMD gets the credit, or physics?
>
> Depends on how you look at it. Physics played a role, but if AMD was
> not as strong of a second fiddle as they are right now, and their best
> effort was only say a 2400+ or so, then Intel wouldn't have any need
> to push so hard for faster stuff. They could just add 100MHz every
> quarter and keep well ahead of AMD and not worry about what happens
> when they hit 4 or 5 GHz. So yes, I think AMD plays a part in this,
> by managing to not have bankrupted themselves as many expected would
> happen in the pre-Athlon days. Just like if AMD hadn't done a 64 bit
> part, Intel would never have done a 64 bit x86 extension, and would
> have successfully pushed everyone to IA64 over the next five years
> just by Moore's Law expanding the minimum DRAM beyond the limit of a
> 32 bit OS on x86.

Basically what you're saying is that physics played a role because Intel was
the one who was pursuing faster and faster speeds, therefore they hit the
wall first. :-)

Yousuf Khan


Klaus Fehrle

unread,
May 8, 2004, 7:25:42 PM5/8/04
to
>I doubt Intel would announce such a major change in strategy without having
> some fairly good ideas of the timelines involved and some initial work
> done to prove them. There are always some delays you don't plan on, but
> that's a problem hardly limited to Intel.

Doug, just my gut feeling, Intels course-correction announced yesterday
sounds more
like an(other) emergency plan than like a strategy.

Put it that way: You can always drop the ball - even if you have loads of
money in the bank. ;-)

> I haven't heard anything remotely concrete about an on-die memory
> controller for Intel, other than claims about it being the real reason
> for the 775 pins in the new socket. For the dual Pentium M, there's no
> reason they'd have to do that, certainly not in their first iteration.

Hmm. I am not sure about that - from a performance point of view, that is.

> After all, Pentium Ms run pretty well with 1600 MB/s FSB in today's
> laptops. Give them the 6.4 GB/s FSB in today's P4s, or by the end of
> 2005 8 GB/s or even 9.6 GB/s for DDR2-553 or DDR2-667, and I think those
> two cores, even if they were twice as fast as today's top end Pentium
> Ms, would be quite well fed memory wise. They don't need the bandwidth
> a P4 does since they don't have the long pipeline and extra high clock
> rates.

I completely agree with Felgers comment on that.

> I think on die memory controllers get more interesting for Intel if/when
> they get FBDRAM going to allow for plenty of memory channels for those
> quad core CPUs they will probably be making in 65nm, without needing 2000
> pin packages. And of course if they push FBDRAM I think AMD would

Dsurelyis


> follow on that, if they don't end up explicitly partnering with Intel and
> others to make it happen.

As for 65nm, I wont hold my breath waiting for it. Certainly, the industry
will get there.
But not anytime soon, from today's viewpoint. As for future memory specs, I
admit
this is a somewhat blind-spot for me. So I can only see the track record of
Intels
DRAM-approaches: RDRAM in the past, DDR-2 in the present, looking like a
non-starter.
FBDRAM? Yeah, sure, if its worthwhile doing it i am confident AMD will
implement it.
Follow??? Rather leading the way to it.

KF


Andi Kleen

unread,
May 8, 2004, 8:31:23 PM5/8/04
to
"Stephen Sprunk" <ste...@sprunk.org> writes:

> Is the trace cache something that (a) could and (b) should be retrofitted
> onto the PM core, or is its decoder fast enough not to affect the critical
> path? That's one of the few ideas in the P4 core I thought had a lot of
> promise...

One thing that I always found strange about the P4 trace cache is that
it was reversing the trend towards bigger caches. Normally software
code gets more bloated and needs bigger icaches and gets them eventually.

But the trace cache is a lot smaller than a more conventional icache,
probably because it is much less die efficient. e.g. compare the 12k
entry P4 trace cache to the 64K l1 icache of K7/K8. Assuming an
average length of 3 bytes/instruction the 64K cache could in theory
hold ~21k instructions, which is nearly twice as much.

Now of course a lot of software will thrash even an 64K icache,
because they do not have a small inner loop. The only cache that has
any chance holding these codes is the big L2 or L3 cache. This means
you need an fast L2/L3 decoder anyways to perform well on these.

Given that requirement is it really that useful to have the trace
cache compared to a big L1/L2 with decoding hints? When you spend a
lot of transistors to make the decoder fast aren't the transistors
spent on the rather die inefficient trace cache then wasted?

I understand that a x86 decoder is a complex beast and likely to
contain frequency limiting speed paths, while the trace cache may
have this problem less. But I see no way around having a fast decoder to
work well on bloated software.

It will be interesting to see how big the icache or trace cache of the
P-M based Prescott successor will be.

-Andi

Stephen Sprunk

unread,
May 8, 2004, 10:07:14 PM5/8/04
to
"Andi Kleen" <fre...@alancoxonachip.com> wrote in message
news:m3brkym...@averell.firstfloor.org...

> "Stephen Sprunk" <ste...@sprunk.org> writes:
> > Is the trace cache something that (a) could and (b) should be
retrofitted
> > onto the PM core, or is its decoder fast enough not to affect the
critical
> > path? That's one of the few ideas in the P4 core I thought had a lot of
> > promise...
>
> One thing that I always found strange about the P4 trace cache is that
> it was reversing the trend towards bigger caches. Normally software
> code gets more bloated and needs bigger icaches and gets them eventually.

Smaller caches provide lower latency, which was supposedly the justification
for the anemic L1D and trace caches in Willamette/Northwood. Prescott
Doubles the L1D size and the associativity, but at the cost of increasing
the latency from one (two?) cycles to four cycles. Based on performance
results to date, this is appears to be a wash.

> But the trace cache is a lot smaller than a more conventional icache,
> probably because it is much less die efficient. e.g. compare the 12k
> entry P4 trace cache to the 64K l1 icache of K7/K8. Assuming an
> average length of 3 bytes/instruction the 64K cache could in theory
> hold ~21k instructions, which is nearly twice as much.

Do the P4 trace cache and the K8 L1 icache take the same amount of die space
or transistors? Got to keep something constant if we're going to compare...

> Now of course a lot of software will thrash even an 64K icache,
> because they do not have a small inner loop. The only cache that has
> any chance holding these codes is the big L2 or L3 cache. This means
> you need an fast L2/L3 decoder anyways to perform well on these.

On cache-busting applications, there's no easy solution; the decoders will
always be in the critical path. However, for applications that DO fit in
the icache, taking the decoders out of the critical path seems like it could
reduce the pipeline length (and thus branch penalties, etc) by a couple
stages. And, you need an L1 cache of some sort to hold the results of those
fast decoders anyways, why not store the instructions as uops instead of x86
instructions?

> It will be interesting to see how big the icache or trace cache of the
> P-M based Prescott successor will be.

Indeed.

Samuel

unread,
May 8, 2004, 11:01:38 PM5/8/04
to
NetBUST is dead... YAY!!!

Another screw up added to Intel's list RamBus, IA-64 and not going with 64
bit X86.

"Douglas Siebert" <dsie...@excisethis.khamsin.net> wrote in message
news:c7gt91$e08$1...@narsil.avalon.net...

> Intel confirmed information The Inquirer had written several months ago,
> they are cancelling the P4 (effective with the next rev that was supposed
> to be out in 2005, Tejas) in favor of their P6 architecture based Pentium
> M core, for both desktops and servers.

Does anyone have the performance numbers of Pentium M? It just doesn't make
sense to me using the same core that is meant for low end mobile computing
in high end server applications, unless they no longer care about Spec Int
numbers and focus on TPCC numbers, much like IBM and Sun, that way they can
connect 4 cores and above and support SMT per core (yes it's called SMT, not
HyperThreading). If that is the case what is going to happen to technical
applications and games that need the Powerfull spec int performance when
everyone go the other route?

This change is really a big deal, one of Intel's strength was frequency and
they used it well as a marketing tool so well. One of the reasons that the
PowerPC didn't win in the desktop market WAS the fact that it could's keep
up with frequency against the Pentiums, thus performance. Now with Intel
going to more Low power multi core approach just levels the field for other
processors to compete better, in fact, other processors already ahead in the
game of designing chips for multi-threading like AMD's Dual Core K8 and K9,
IBM's Power4, 5, 6 and Sun's Rock and Niagra.

What I'm wondering right now is how on earth a dual core Pentium M processor
can beat a dual core K8 and K9?

Greg Lindahl

unread,
May 9, 2004, 12:18:45 AM5/9/04
to
In article <m3brkym...@averell.firstfloor.org>,
Andi Kleen <fre...@alancoxonachip.com> wrote:

>But the trace cache is a lot smaller than a more conventional icache,
>probably because it is much less die efficient. e.g. compare the 12k
>entry P4 trace cache to the 64K l1 icache of K7/K8. Assuming an
>average length of 3 bytes/instruction the 64K cache could in theory
>hold ~21k instructions, which is nearly twice as much.

I don't think that's a good assumption. PathScale's compiler is the
best compiler for AMD64, and while I don't have a simulator in hand to
tell you the actual data for a benchmark like SPEC, from staring at a
lot of floating point code, I think our average is around 5
bytes/instruction -- remember that 64 bit instructions often have an
extra byte. That means that on occasion, even the aggressive K8
decoder can't issue 3 instructions in a cycle, because it can't get
enough bytes.

A trace cache never has that problem. And if it wastes some
transistors, who cares as long as it doesn't limit cycle time?
And it's decoupled from the instruction parser... so it's not
like there's any weird complexity increase...

There are also other benefits to a trace cache, such as the potential
for smaller branch bubbles. Isn't the P4 better than K8 in that area?

-- greg
(disclaimer: I work for PathScale, but don't speak for them.)

Nick Maclaren

unread,
May 9, 2004, 5:36:03 AM5/9/04
to
In article <c7jfmg$ah0$1...@narsil.avalon.net>,

Douglas Siebert <dsie...@excisethis.khamsin.net> wrote:
>
>That depends on how long ago they started on it. The Inquirer had rumors
>of a 64 bit Pentium M skunkworks project since the beginning of the year,
>and if true, who knows how long they would have been working on it. I
>doubt Intel would announce such a major change in strategy without having
>some fairly good ideas of the timelines involved and some initial work
>done to prove them. There are always some delays you don't plan on, but
>that's a problem hardly limited to Intel.

And some of us were hpyothesising such developments a year before that.
If I, as a complete outsider and not even a hardware person, can do
the relevant sums, I am absolutely sure that Intel could. Unless the
management were COMPLETELY incompetent, they would have realised that
there was a significant chance of the current problems NOT being
soluble (in time, effectively, etc.) and would have set up a backup
scheme.

If this were a sweepstake, I would bet on 1Q03 for the start.


Regards,
Nick Maclaren.

Andi Kleen

unread,
May 9, 2004, 5:35:47 AM5/9/04
to
lin...@pbm.com (Greg Lindahl) writes:

> In article <m3brkym...@averell.firstfloor.org>,
> Andi Kleen <fre...@alancoxonachip.com> wrote:
>
>>But the trace cache is a lot smaller than a more conventional icache,
>>probably because it is much less die efficient. e.g. compare the 12k
>>entry P4 trace cache to the 64K l1 icache of K7/K8. Assuming an
>>average length of 3 bytes/instruction the 64K cache could in theory
>>hold ~21k instructions, which is nearly twice as much.
>
> I don't think that's a good assumption. PathScale's compiler is the
> best compiler for AMD64, and while I don't have a simulator in hand to
> tell you the actual data for a benchmark like SPEC, from staring at a
> lot of floating point code, I think our average is around 5
> bytes/instruction -- remember that 64 bit instructions often have an

I was talking about 32bit integer code. 64bit floating point code
is totally different because it uses SSE2, which is much bigger
than normal x86 instructions. For x87 code it is true too.

But it is an interesting theory. Did Intel add the trace
cache because it was the only way to get their SSE2 FPU
fed quickly enough?

> extra byte. That means that on occasion, even the aggressive K8
> decoder can't issue 3 instructions in a cycle, because it can't get
> enough bytes.

I just ran some quick statistics on my gcc 3.2 generated 32bit
/usr/bin, and it gives on average 3.356 bytes. So my number was not
too far off for 32bit x86. Of course that is with x87 and not
particularly FP intensive.

x86-64 is different because of the REX prefix bytes, but has on
average less instructions for a given C function because of the more
registers and less spill code, which offsets this. Overall the code
length for a given C function are usually in the same league compared
to 32bit with SSE2 [all this with gcc; i don't know how your compiler or the
Microsoft compiler do ..., still waiting for the GPL release of yours]

I mention SSE2 as a special case because 64bit code using SSE2 is bigger
than x87 using 32bit code simply because SSE2 is a lot bigger than
FP stack code. Modern optimized 32bit code will use SSE2
anyways, but shipping production code often does not because running
on non SSE2 supporting CPUs is still important. 64bit code always
uses SSE2. Simple comparisons can be misleading.

It also depends on whether the code is optimized for K7/K8/P3/P-M or
P4. P4 optimized code can be shorter because the trace cache does not
need much extra alignments unlike the other CPUs (as long as you do
not need thrash it). It also can use shorter function prologues/epilogues
because it can execute lots of push and pops in parallel instead
of requiring the code size wasting tricks Opteron needs for this
to be fast.

On the other hand this all assumes that your code actually usually
hits the trace cache, which may not be the case in a lot of software.

If your code thrashes the trace cache it is possible that the P4 even
needs a different code generation strategy than what is recommended in
the Intel optimization guide, with more branch target alignment and
different function prologues. Would be interesting to benchmark this
out a bit.

> A trace cache never has that problem. And if it wastes some
> transistors, who cares as long as it doesn't limit cycle time?
> And it's decoupled from the instruction parser... so it's not
> like there's any weird complexity increase...

My point was that you need to have the fast decoder for "modern"
bloated software anyways; so why bother with the trace cache too?

> There are also other benefits to a trace cache, such as the potential
> for smaller branch bubbles. Isn't the P4 better than K8 in that area?

Hmm, let's see. The Opteron optimization guide says 1 cycle latency
for a fully predicted branch. A trace cache hit is equivalent to
"fully predicted" right? Otherwise it is not too likely for the target
to be in the trace cache, except for very small codes.

I don't have the number for a predicted branch for P4, but it is
unlikely that it is better than 1 cycle latency.

If you mean non predicted branches with branch bubbles then I do not
have any numbers; but I assume that a trace cache will not help
much with these anyways.

What I know is that the P4 gets *extremly* slow when it has to flush
the trace cache and replay the instruction stream, and it does this
far too often :-(

-Andi

Andi Kleen

unread,
May 9, 2004, 6:07:17 AM5/9/04
to
Andi Kleen <fre...@alancoxonachip.com> writes:
>
> I just ran some quick statistics on my gcc 3.2 generated 32bit
> /usr/bin, and it gives on average 3.356 bytes. So my number was not
> too far off for 32bit x86. Of course that is with x87 and not
> particularly FP intensive.
>
> x86-64 is different because of the REX prefix bytes, but has on
> average less instructions for a given C function because of the more
> registers and less spill code, which offsets this. Overall the code
> length for a given C function are usually in the same league compared
> to 32bit with SSE2 [all this with gcc; i don't know how your compiler or the
> Microsoft compiler do ..., still waiting for the GPL release of yours]

Addendum: for a gcc 3.3 compiled 64bit /usr/bin it is 3.481 bytes average.
So roughly comparable.

-Andi

Nick Maclaren

unread,
May 9, 2004, 6:17:51 AM5/9/04
to
In article <slrnc9qc7j...@ducky.net>,

Yes, indeed, if you mean the earlier stages. But, if you mean the
final stages, that is not so. Intel can't afford to put an indefinite
number of designs through chipset integration, validation and all that.
I don't know which Stephen Sprunk meant.

This was the whole lunacy about "Yamhill". The interesting question
never was whether there was a 64-bit extension project( we know there
were several), but whether any had got the "go ahead" for the later
and more expensive stages of development.


Regards,
Nick Maclaren.

ando_san

unread,
May 9, 2004, 10:03:27 AM5/9/04
to
Dear Andy,

Where did Eric go? This is a personal question. You may reply to my
address.

Best Regards,

H.Ando


"Andy Glew" <glew2pub...@sbcglobal.net> wrote in message
news:fHcnc.46320$dJ3....@newssvr29.news.prodigy.com...
>

Bengt Larsson

unread,
May 9, 2004, 12:10:27 PM5/9/04
to
"Samuel" <sam...@austin.rr.com> wrote:

>NetBUST is dead... YAY!!!
>
>Another screw up added to Intel's list RamBus, IA-64 and not going with 64
>bit X86.
>

>...

Just a quiet question: why all the Intel hatred?

Yousuf Khan

unread,
May 9, 2004, 1:20:35 PM5/9/04
to
Stephen Sprunk wrote:
> On cache-busting applications, there's no easy solution; the decoders
> will always be in the critical path. However, for applications that
> DO fit in the icache, taking the decoders out of the critical path
> seems like it could reduce the pipeline length (and thus branch
> penalties, etc) by a couple stages. And, you need an L1 cache of
> some sort to hold the results of those fast decoders anyways, why not
> store the instructions as uops instead of x86 instructions?

How would you eliminate branching even in a trace cache? Isn't branching
just as much of an atomic instruction as any other micro-op?

Yousuf Khan


Yousuf Khan

unread,
May 9, 2004, 1:30:42 PM5/9/04
to
Samuel <sam...@austin.rr.com> wrote:
> Does anyone have the performance numbers of Pentium M? It just
> doesn't make sense to me using the same core that is meant for low
> end mobile computing in high end server applications, unless they no
> longer care about Spec Int numbers and focus on TPCC numbers, much
> like IBM and Sun, that way they can connect 4 cores and above and
> support SMT per core (yes it's called SMT, not HyperThreading). If
> that is the case what is going to happen to technical applications
> and games that need the Powerfull spec int performance when everyone
> go the other route?

In general, it would seem to me that server chips in the x86 world are just
bigger-cache versions of desktop and mobile chips. In fact, in the low-power
blade server world, the Pentium M is already used.

> This change is really a big deal, one of Intel's strength was
> frequency and they used it well as a marketing tool so well. One of
> the reasons that the PowerPC didn't win in the desktop market WAS the
> fact that it could's keep up with frequency against the Pentiums,
> thus performance. Now with Intel going to more Low power multi core
> approach just levels the field for other processors to compete
> better, in fact, other processors already ahead in the game of
> designing chips for multi-threading like AMD's Dual Core K8 and K9,
> IBM's Power4, 5, 6 and Sun's Rock and Niagra.

Actually the main reason that PowerPC didn't win in the desktop market was
not because of performance, but because it simply didn't run x86 software,
and especially because it was cubbyholed into the small Macintosh world. It
wouldn't have mattered if PowerPC was an order of magnitude faster than any
x86 chip, the great mass of software was concentrated in the x86 world --
unless PowerPC could run that stuff, then it wasn't going anywhere.

> What I'm wondering right now is how on earth a dual core Pentium M
> processor can beat a dual core K8 and K9?

Unless Pentium-M grows an onboard memory controller in a few months, the
only option Intel has to hope to stay on pace with AMD is to add tons of L2
cache. Of course AMD can do the same, but it can afford to add less cache
than Intel to stay on par, due to its onboard memory controller.

Yousuf Khan


Yousuf Khan

unread,
May 9, 2004, 1:40:44 PM5/9/04
to
Andi Kleen <fre...@alancoxonachip.com> wrote:
> I was talking about 32bit integer code. 64bit floating point code
> is totally different because it uses SSE2, which is much bigger
> than normal x86 instructions. For x87 code it is true too.
>
> But it is an interesting theory. Did Intel add the trace
> cache because it was the only way to get their SSE2 FPU
> fed quickly enough?

Those 128-bit registers of SSE and those 80-bit registers of x87 are fed by
the D-cache aren't they? So why should SSE or x87 FPU instructions have a
larger footprint in an instruction or trace cache?

BTW, is there an officially accepted general term to describe either a trace
cache or an instruction cache? I would think just calling them both
instruction caches should be sufficient?

>> extra byte. That means that on occasion, even the aggressive K8
>> decoder can't issue 3 instructions in a cycle, because it can't get
>> enough bytes.
>
> I just ran some quick statistics on my gcc 3.2 generated 32bit
> /usr/bin, and it gives on average 3.356 bytes. So my number was not
> too far off for 32bit x86. Of course that is with x87 and not
> particularly FP intensive.
>
> x86-64 is different because of the REX prefix bytes, but has on
> average less instructions for a given C function because of the more
> registers and less spill code, which offsets this. Overall the code
> length for a given C function are usually in the same league compared
> to 32bit with SSE2 [all this with gcc; i don't know how your compiler
> or the Microsoft compiler do ..., still waiting for the GPL release
> of yours]

I can see the extra registers of AMD64 would likely reduce instruction
lengths, because you eliminate that bad habit in x86 code of doing
operations directly in memory rather than in a register because of a lack of
available registers.

Yousuf Khan


Terje Mathisen

unread,
May 9, 2004, 2:23:32 PM5/9/04
to
Yousuf Khan wrote:

> Andi Kleen <fre...@alancoxonachip.com> wrote:
>>x86-64 is different because of the REX prefix bytes, but has on
>>average less instructions for a given C function because of the more
>>registers and less spill code, which offsets this. Overall the code
>>length for a given C function are usually in the same league compared
>>to 32bit with SSE2 [all this with gcc; i don't know how your compiler
>>or the Microsoft compiler do ..., still waiting for the GPL release
>>of yours]
>
> I can see the extra registers of AMD64 would likely reduce instruction
> lengths, because you eliminate that bad habit in x86 code of doing
> operations directly in memory rather than in a register because of a lack of
> available registers.

Using x86 operate_from_mem style instructions is almost always fine, it
saves both registers, instructions and code space.

Read-modify-write is another case totally. :-(

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Daniel Gustafsson

unread,
May 9, 2004, 2:25:05 PM5/9/04
to
"Samuel" <sam...@austin.rr.com> wrote in message news:<mahnc.71878$NR5....@fe1.texas.rr.com>...

> "Douglas Siebert" <dsie...@excisethis.khamsin.net> wrote in message
> news:c7gt91$e08$1...@narsil.avalon.net...
> > Intel confirmed information The Inquirer had written several months ago,
> > they are cancelling the P4 (effective with the next rev that was supposed
> > to be out in 2005, Tejas) in favor of their P6 architecture based Pentium
> > M core, for both desktops and servers.
>
> Does anyone have the performance numbers of Pentium M? It just doesn't make
> sense to me using the same core that is meant for low end mobile computing
> in high end server applications, unless they no longer care about Spec Int
> numbers and focus on TPCC numbers, much like IBM and Sun, that way they can
> connect 4 cores and above and support SMT per core (yes it's called SMT, not
> HyperThreading). If that is the case what is going to happen to technical
> applications and games that need the Powerfull spec int performance when
> everyone go the other route?

The Pentium M has similarities with Pentium 3 and those did not had
bad SPECint numbers. Besides, the current Pentium M's has such low
power usage they may currently be clocked down to meet specific power
requirements. Intel may have figured out that by powering up the
current Pentium M cores a bit and give them a year of development then
they may be very competetive.

Whether they will add SMT to these cores is I think still a secret.

( Although I get what you say, Sun does not focus on TPCC numbers ;) )

Regards
Daniel Gustafsson

Yousuf Khan

unread,
May 9, 2004, 3:01:51 PM5/9/04
to
Terje Mathisen <terje.m...@hda.hydro.com> wrote:

> Yousuf Khan wrote:
>> I can see the extra registers of AMD64 would likely reduce
>> instruction lengths, because you eliminate that bad habit in x86
>> code of doing operations directly in memory rather than in a
>> register because of a lack of available registers.
>
> Using x86 operate_from_mem style instructions is almost always fine,
> it saves both registers, instructions and code space.
>
> Read-modify-write is another case totally. :-(

You're right, they shouldn't be much of a problem either. I was actually
thinking of those x86 operand-embedded-in-instruction style instructions.

Yousuf Khan


Terje Mathisen

unread,
May 9, 2004, 4:03:28 PM5/9/04
to
Yousuf Khan wrote:

Huh?

Which opcodes would that be?

Those with immediate data? Implicit registers.

The only thing that comes close afaik would seem to be a couple of the
MMX/SSE permute operations where the actual operation to perform can be
a runtime variable?

Bengt Larsson

unread,
May 9, 2004, 5:10:18 PM5/9/04
to
Douglas Siebert <dsie...@excisethis.khamsin.net> wrote:

>IA64 aficionados may want to note that The Inquirer mentioned today that
>they are now hearing rumblings that some parts of the IA64 roadmap are
>getting cancelled as well. Guess we'll see if their sources for that
>are as good as their sources for the P4's cancellation were proven to be.

I guess I'm one of those aficionados, sort of. On purely technical
grounds I'd prefer IA-64 over x86-64. As a programmer, I don't look
forward to 20 more years of x86. IA-64 is a bit more modern, more
RISC-like, has more registers and so on.

It's true that Intel would have a monopoly on IA-64 but there are two
comments one can make on that:

1. Intel want to compete with IBM on processors for the high end, and
IBM show no signs of stopping, so there will be competition.

2. If Intel were to go too far and milk their monopoly too much there
is a very simple remedy: force them to license IA-64 to someone. The
Anti-trust remedy writes itself, unlike cases vs. Microsoft and IBM.

I have nothing against PowerPC, but it's domineered/dominated by IBM.
There will never realistically be competition on PowerPC-based systems
vs IBM.

The ideal would be processors from Intel and AMD, systems from other
people (like Dell, HP, IBM...), operating systems from yet other
people (like Linux) etc. All to promote competition.

Samuel

unread,
May 9, 2004, 5:23:52 PM5/9/04
to

"Bengt Larsson" <bengt...@telia.NOSPAMcom> wrote in message
news:tqls90lb2bs8e6h1i...@text.giganews.com...

>
> Just a quiet question: why all the Intel hatred?

I guess it shows, huh? ;)

Samuel

unread,
May 9, 2004, 5:42:35 PM5/9/04
to

"Yousuf Khan" <news.tal...@spamgourmet.com> wrote in message
news:6Vtnc.8296$pp....@news04.bloor.is.net.cable.rogers.com...

> Samuel <sam...@austin.rr.com> wrote:
> Actually the main reason that PowerPC didn't win in the desktop market was
> not because of performance, but because it simply didn't run x86 software,

Sure there is no arguing that, but there was a time when G4 PowerPC was
lagging WAY behind in Frequency and Spec Int performance, particularly with
the G3 and the G4. At one time the Pentium4 was 2-3X the frequency of the G4
which got Steve Jobs very worried. Then Jobs pulled the G5 development from
Motorola granted it to IBM.

Anyhow, with IBM clocking the PPC976 @ 3.5 GHz and Intel falling back on a 2
GHz core, I see Apple would hardly be worried about frequency wars in the
next couple of year at least. Adding to that the Power6 core will be a speed
demon which is a 180 degree shift from the Power4, Power5 approach, Apple
would be even more comfortable with frequency in the future (assuming IBM
don't screw up with Power6 like Intel screwed up with NetBUST)


Bengt Larsson

unread,
May 9, 2004, 5:44:06 PM5/9/04
to
"Samuel" <sam...@austin.rr.com> wrote:

Yeah it does. But why?

Samuel

unread,
May 9, 2004, 6:04:13 PM5/9/04
to

"Daniel Gustafsson" <dan...@mimer.se> wrote in message

> The Pentium M has similarities with Pentium 3 and those did not had
> bad SPECint numbers.

Yes and I was a very good wrestler when I was in High School.

PIIIs were great at some point, probably even before the first Athlon came
out. The PIIIs were competing very well with K6 and the four stage pipelined
G3 PPC. When the Athlon came out it took the lead in Spec int performance
and the PIII cores started to show their age. Today it's a different world,
AMD is doing a great job and they are ahead with multi core design with
cores that are superior to the good old PIIIs.

I don't really see how the "Bach to the Future" approach would make Intel
processors competative in performance with AMD.

> ( Although I get what you say, Sun does not focus on TPCC numbers ;) )

From reading about the Rock, Niagra and follow ups it seems like it's all
they care about now.

>
> Regards
> Daniel Gustafsson


Samuel

unread,
May 9, 2004, 6:06:07 PM5/9/04
to

"Bengt Larsson" <bengt...@telia.NOSPAMcom> wrote in message
> Yeah it does. But why?

That should be a topic of it's own :)

Norbert Juffa

unread,
May 9, 2004, 6:14:54 PM5/9/04
to

"Andi Kleen" <fre...@alancoxonachip.com> wrote in message news:m37jvmx...@averell.firstfloor.org...

> lin...@pbm.com (Greg Lindahl) writes:
>
> > In article <m3brkym...@averell.firstfloor.org>,
> > Andi Kleen <fre...@alancoxonachip.com> wrote:
> >
> >>But the trace cache is a lot smaller than a more conventional icache,
> >>probably because it is much less die efficient. e.g. compare the 12k
> >>entry P4 trace cache to the 64K l1 icache of K7/K8. Assuming an
> >>average length of 3 bytes/instruction the 64K cache could in theory
> >>hold ~21k instructions, which is nearly twice as much.
> >
> > I don't think that's a good assumption. PathScale's compiler is the
> > best compiler for AMD64, and while I don't have a simulator in hand to
> > tell you the actual data for a benchmark like SPEC, from staring at a
> > lot of floating point code, I think our average is around 5
> > bytes/instruction -- remember that 64 bit instructions often have an
>
> I was talking about 32bit integer code. 64bit floating point code
> is totally different because it uses SSE2, which is much bigger
> than normal x86 instructions. For x87 code it is true too.
[...]

It's not clear to me what "for x87 code it is true too" refers to.
Could you clarify please?

When I last checked several years ago, 32-bit integer code on x86
took up approximately 3.8 bytes/instruction. IIRC, x87 intensive
code had a _shorter_ average instruction length due to its tight
encoding, where one register operand (ST0) is implicit. Also, x87
code in general did not need prefixes (e.g. 0x0f, 0x66 etc). FWIW,
16-bit x86 integer code (ca 1995) ran about 2.8 bytes/instruction.

Of course, average instruction length is somewhat a function of
instruction mix issues by a particular compiler, e.g. use of
load-execute instructions versus separate load and reg-to-reg
instruction.

-- Norbert


Bengt Larsson

unread,
May 9, 2004, 6:18:00 PM5/9/04
to
"Samuel" <sam...@austin.rr.com> wrote:

>Anyhow, with IBM clocking the PPC976 @ 3.5 GHz and Intel falling back on a 2
>GHz core, I see Apple would hardly be worried about frequency wars in the
>next couple of year at least. Adding to that the Power6 core will be a speed
>demon which is a 180 degree shift from the Power4, Power5 approach, Apple
>would be even more comfortable with frequency in the future (assuming IBM
>don't screw up with Power6 like Intel screwed up with NetBUST)

I never bought the reasoning that NetBurst was high-frequency for
marketing. There were technical papers that showed a performance
advantage up to 50 stages (the P4 had 20). If Intel wanted to they
could have marketed NetBurst as capable of adding integers at 6.4 GHz,
but they never bothered. It wouldn't have worked as marketing. It
would have been true, in a sense, but obviously not for any real
applications - and people would have noticed.

AMD bypassed the whole thing with their xxxx+ marketing anyway.

Why people look for conspiracy explanations when there are natural
explanations I will never understand. Are conspiracy explanations more
interesting?

Samuel

unread,
May 9, 2004, 6:39:21 PM5/9/04
to

"Bengt Larsson" <bengt...@telia.NOSPAMcom> wrote in message
news:jtat90paet5l02li3...@text.giganews.com...

> I never bought the reasoning that NetBurst was high-frequency for
> marketing. There were technical papers that showed a performance
> advantage up to 50 stages (the P4 had 20). If Intel wanted to they
> could have marketed NetBurst as capable of adding integers at 6.4 GHz,
> but they never bothered. It wouldn't have worked as marketing. It
> would have been true, in a sense, but obviously not for any real
> applications - and people would have noticed.

I'm sure frequency was not the ONLY target with P4 design, they also wanted
to get performacne out of driving frequency and they thought they can scale
frequency better with P4 and follow ups that they can do with a P3 like
architecture. However, it's hard to deny that there was an abvious frequency
wars in the desktop market and the average consumor only looks at the mega
Hz numbers when shopping for a new PC for Christmas. It seemed to me and to
many people at the time that they were playing this as a marketing tool.

> AMD bypassed the whole thing with their xxxx+ marketing anyway.

That was a marketing genius, wasn't it?

> Why people look for conspiracy explanations when there are natural
> explanations I will never understand. Are conspiracy explanations more
> interesting?

I guess I'm a conspiracy theorist Intel hater. LOL


Bengt Larsson

unread,
May 9, 2004, 6:54:56 PM5/9/04
to
"Samuel" <sam...@austin.rr.com> wrote:

>"Bengt Larsson" <bengt...@telia.NOSPAMcom> wrote in message
>news:jtat90paet5l02li3...@text.giganews.com...

>I guess I'm a conspiracy theorist Intel hater. LOL

You said it.

Stephen Sprunk

unread,
May 9, 2004, 7:17:18 PM5/9/04
to
"Bengt Larsson" <bengt...@telia.NOSPAMcom> wrote in message
news:jtat90paet5l02li3...@text.giganews.com...
> I never bought the reasoning that NetBurst was high-frequency for
> marketing. There were technical papers that showed a performance
> advantage up to 50 stages (the P4 had 20). If Intel wanted to they
> could have marketed NetBurst as capable of adding integers at 6.4 GHz,
> but they never bothered. It wouldn't have worked as marketing. It
> would have been true, in a sense, but obviously not for any real
> applications - and people would have noticed.

Did you miss when Intel demonstrated 10GHz ALUs last year? The P4 strategy
was always based on marketing GHz over performance, but their recent
setbacks in increasing clock speed have caused them to kill the entire
product.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

Stefan Monnier

unread,
May 9, 2004, 7:35:47 PM5/9/04
to
> Did you miss when Intel demonstrated 10GHz ALUs last year? The P4 strategy
> was always based on marketing GHz over performance, but their recent
> setbacks in increasing clock speed have caused them to kill the entire
> product.

There's no question that Intel's marketing has played pretty heavily the
Ghz song. But this newsgroup is not about marketing, so the real question
is whether the marketing drove the microarchitecture or not.

I personally don't believe it did. The P4 is a pretty good performer if
you ask me, so there seem to have been valid technical reasons to go
this route.


Stefan

Yousuf Khan

unread,
May 9, 2004, 8:13:40 PM5/9/04
to
Bengt Larsson <bengt...@telia.NOSPAMcom> wrote:
> It's true that Intel would have a monopoly on IA-64 but there are two
> comments one can make on that:
>
> 1. Intel want to compete with IBM on processors for the high end, and
> IBM show no signs of stopping, so there will be competition.

Everybody else gave up the ghost at least five years ago, at the mere
thought of having to compete against IA64 before there was even a working
IA64, and instead embraced it wholeheartedly -- bye-bye MIPS, Alpha,
PA-RISC, etc. What you're merely saying is that we don't have to worry about
lack of competition because there were at least a few corporations that were
not stupid enough to give up their own processor architectures. What if they
had _all_ decided to give up at the mere mention of competition from Intel?

> 2. If Intel were to go too far and milk their monopoly too much there
> is a very simple remedy: force them to license IA-64 to someone. The
> Anti-trust remedy writes itself, unlike cases vs. Microsoft and IBM.

It's that simple, huh? Intel has managed to make life very difficult for its
x86 competitors to sell their processors to OEMs, and yet Intel manages to
keep away from the anti-trust authorities, because it never ever writes down
its threats.

> I have nothing against PowerPC, but it's domineered/dominated by IBM.
> There will never realistically be competition on PowerPC-based systems
> vs IBM.

There was Motorola too.

> The ideal would be processors from Intel and AMD, systems from other
> people (like Dell, HP, IBM...), operating systems from yet other
> people (like Linux) etc. All to promote competition.

Yes, that would be the ideal. However that's what's happening already, but
in a lopsided fashion.

Yousuf Khan


Yousuf Khan

unread,
May 9, 2004, 8:03:39 PM5/9/04
to
Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> Yousuf Khan wrote:
>> You're right, they shouldn't be much of a problem either. I was
>> actually thinking of those x86 operand-embedded-in-instruction style
>> instructions.
>
> Huh?
>
> Which opcodes would that be?
>
> Those with immediate data? Implicit registers.

Yes, the immediate data would be the one. Things such as:

mov eax, 0x00000001

Where that final 32-bit number "1" would occupy a full 4 bytes in the
instruction stream.

But remind me, which instruction forms are the implicit registers?

> The only thing that comes close afaik would seem to be a couple of the
> MMX/SSE permute operations where the actual operation to perform can
> be a runtime variable?

I wasn't really talking about the SIMD instructions, just the good old
fashioned x86 ones. Not familiar enough with the newer instructions. I used
to program in assembly back in the 386 days.

Yousuf Khan


Yousuf Khan

unread,
May 9, 2004, 8:23:42 PM5/9/04
to
Samuel <sam...@austin.rr.com> wrote:
> "Yousuf Khan" <news.tal...@spamgourmet.com> wrote in message
>> Actually the main reason that PowerPC didn't win in the desktop
>> market was not because of performance, but because it simply didn't
>> run x86 software,
>
> Sure there is no arguing that, but there was a time when G4 PowerPC
> was lagging WAY behind in Frequency and Spec Int performance,
> particularly with the G3 and the G4. At one time the Pentium4 was
> 2-3X the frequency of the G4 which got Steve Jobs very worried. Then
> Jobs pulled the G5 development from Motorola granted it to IBM.

Jobs needn't have worried. Just like there was no way PC people were ever
going to switch to a Macintosh processor, no matter what the performance,
similarly there was no way that Macintosh people would've ever switched to a
PC processor. You're just stuck in the environment that you're stuck in.

> Anyhow, with IBM clocking the PPC976 @ 3.5 GHz and Intel falling back
> on a 2 GHz core, I see Apple would hardly be worried about frequency
> wars in the next couple of year at least. Adding to that the Power6
> core will be a speed demon which is a 180 degree shift from the
> Power4, Power5 approach, Apple would be even more comfortable with
> frequency in the future (assuming IBM don't screw up with Power6 like
> Intel screwed up with NetBUST)

IBM isn't at 3.5 Ghz yet, and it would seem rather optimistic that they'll
even touch that speed even with 90nm. What is the pipeline length of the PPC
97x? About 10 stages? I'd say AMD would be closer to 3.5 Ghz with its 12
stage pipeline than the PPC. AMD is already nearing 2.5 Ghz with a 130nm
process.

Sure clock frequency isn't everything, but if you are going to make it to
certain frequency, the pipeline has to accomodate it.

Yousuf Khan


Yousuf Khan

unread,
May 9, 2004, 8:23:42 PM5/9/04
to
Bengt Larsson <bengt...@telia.NOSPAMcom> wrote:
> AMD bypassed the whole thing with their xxxx+ marketing anyway.

Which was fortunate for them. The previous attempt at equating true
performance against an Intel processor ended up hurting the manufacturer
that tried to pass it off (i.e. Cyrix and even AMD to a certain extent).

Yousuf Khan


Yousuf Khan

unread,
May 9, 2004, 8:23:43 PM5/9/04
to
Samuel <sam...@austin.rr.com> wrote:
> PIIIs were great at some point, probably even before the first Athlon
> came out. The PIIIs were competing very well with K6 and the four
> stage pipelined G3 PPC. When the Athlon came out it took the lead in
> Spec int performance and the PIII cores started to show their age.
> Today it's a different world, AMD is doing a great job and they are
> ahead with multi core design with cores that are superior to the good
> old PIIIs.
>
> I don't really see how the "Bach to the Future" approach would make
> Intel processors competative in performance with AMD.

Well, it's all Intel has got right now at the moment. And it's not like as
if it is a completely unmodified P3 core, it's got all of that wonderful
power savings feature.

Yousuf Khan


Douglas Siebert

unread,
May 10, 2004, 12:01:05 AM5/10/04
to
"Stephen Sprunk" <ste...@sprunk.org> writes:

>Google doesn't turn up much on FBDRAM, but it appears it'll have the same
>pincount as DDR2, so I doubt we'll be getting past two channels (per chip)
>any time soon, regardless of how many cores we can cram into a die.


Search under "fully buffered DRAM" and you'll probably find more...

FBDRAM uses far fewer pins than DDR2, but plans are that it will use the
DDR2 socket (at least initially) for cost/compatibility reasons. FBDRAM
uses existing DRAM chips. It could use DDR2 or some future thing like
DDR3 without changing the memory controller -- great for on die memory
controllers! A FBDRAM DIMM would look the same with the addition of a
single chip (the buffer) that makes it a FBDRAM DIMM. It is probably
doable to set things up so that you could have a motherboard that supported
FBDRAM but could detect if you plugged regular DIMMs in and use them
instead.

IIRC there are only 69 pins required per FBDRAM controller, so you can
support more than twice the channels with FBDRAM, and it allows for a
larger number of DIMMs per channel.

--
Douglas Siebert dsie...@excisethis.khamsin.net

When hiring, avoid unlucky people, they are a risk to the firm. Do this by
randomly tossing out 90% of the resumes you receive without looking at them.

Douglas Siebert

unread,
May 10, 2004, 12:10:10 AM5/10/04
to
"Samuel" <sam...@austin.rr.com> writes:

>Does anyone have the performance numbers of Pentium M? It just doesn't make
>sense to me using the same core that is meant for low end mobile computing
>in high end server applications, unless they no longer care about Spec Int
>numbers and focus on TPCC numbers, much like IBM and Sun, that way they can
>connect 4 cores and above and support SMT per core (yes it's called SMT, not
>HyperThreading). If that is the case what is going to happen to technical
>applications and games that need the Powerfull spec int performance when
>everyone go the other route?


Intel doesn't want to sell you a desktop CPU stuff for stuff that needs
"Powerfull spec int performance", they want you to buy Itanium for that.

Douglas Siebert

unread,
May 10, 2004, 12:16:48 AM5/10/04
to
"Felger Carbon" <fms...@jfoops.net> writes:

>"Douglas Siebert" <dsie...@excisethis.khamsin.net> wrote in message

>news:c7jfmg$ah0$1...@narsil.avalon.net...
>>
>> I haven't heard anything remotely concrete about an on-die memory
>> controller for Intel, other than claims about it being the real
>reason
>> for the 775 pins in the new socket. For the dual Pentium M, there's
>no
>> reason they'd have to do that, certainly not in their first
>iteration.
>> After all, Pentium Ms run pretty well with 1600 MB/s FSB in today's
>> laptops. Give them the 6.4 GB/s FSB in today's P4s, or by the end
>of
>> 2005 8 GB/s or even 9.6 GB/s for DDR2-553 or DDR2-667, and I think
>those
>> two cores, even if they were twice as fast as today's top end
>Pentium
>> Ms, would be quite well fed memory wise. They don't need the
>bandwidth
>> a P4 does since they don't have the long pipeline and extra high
>clock
>> rates.

>I believe the main point of on-die memory controllers is reduced
>latency, not improved bandwidth.


I believe the post I was responding to was implying that a dual core
Pentium M would demand twice as much bandwidth and therefore suffer
performance-wise. But you are correct, on die reduces latency, though
that reduced latency does give a small bandwidth benefit as part of
the deal.


>> I think on die memory controllers get more interesting for Intel
>if/when
>> they get FBDRAM going to allow for plenty of memory channels for
>those
>> quad core CPUs they will probably be making in 65nm, without needing
>2000
>> pin packages.

>You seem to believe that each core on the quad-core chip will have an
>independent memory controller/channel. While I have no definite
>information to the contrary, it seems unlikely. Am I missing
>something here?


No, I don't believe that at all. The cores will share the same memory
controllers. But since a P4 performs better with a 800 MHz FSB than
with 400 MHz, and an A64 better with dual channels rather than one, it
stands to reason that a single core with X amount of memory bandwidth
will perform worse than dual cores with X amount of memory bandwidth.
And it only gets worse with quad cores. Clearly it wouldn't be cost
effective to provide quad channel DDR (let alone the even more pin hungry
DDR2 which will be mainstream when 65 nm stuff comes out in 2006) But
with FBDRAM, you need fewer pins for quad channels than dual channel DDR
requires today.

Bill Todd

unread,
May 10, 2004, 1:19:34 AM5/10/04
to

"Yousuf Khan" <news.tal...@spamgourmet.com> wrote in message
news:vFznc.4839$n7P1...@twister01.bloor.is.net.cable.rogers.com...

... Things such as:

>
> mov eax, 0x00000001
>
> Where that final 32-bit number "1" would occupy a full 4 bytes in the
> instruction stream.

Oh, my - it's been a *long* time. But though the details have faded from
memory ISTR that the x86 instruction set provides mechanisms for compressing
immediate operands that will fit into 1 or 2 bytes and zero- or
sign-extending them to full destination width (leaving aside explicit
mechanisms such as MOVZX and MOVSX - which may not take immediate source
operands - multi-instruction sequences and creative use of instructions such
as LEA).

- bill

Andi Kleen

unread,
May 10, 2004, 2:00:26 AM5/10/04
to
"Norbert Juffa" <ju...@earthlink.net> writes:

> It's not clear to me what "for x87 code it is true too" refers to.
> Could you clarify please?

x87 code is much shorter than SSE2 code, especially 64bit SSE2 code
which has additional REX prefixes.

What I attempted to say was that you should not compare 32bit-with-x87
to 64bit-with-SSE2, but 32bit-with-SSE2 to 64bit-with-SSE2.
And Greg's experiences with heavy SSE2 floating point code are somewhat
of an exceptional case for code length comparisons.

Or rather if you do such comparisons compare both, one applies to
optimized software and the other to packaged software, where shipping
x87 code will be probably the norm for some more years until all the
non SSE2 x86s are throughly obsolete. It probably does not make that
much difference, because floating point heavy code is usually rare
(and when it is not you are more likely to work with "optimized"
instead of "generic" code). The Intel compiler actually has options to
generate paths for both, but comparing to that would be really unfair.

> When I last checked several years ago, 32-bit integer code on x86
> took up approximately 3.8 bytes/instruction. IIRC, x87 intensive

I get 3.2 bytes/instructions with gcc 3.2 (and 3.4 bytes for x86-64
with gcc 3.3-hammer, but with less instructions). The gcc 3.2 code was
mostly optimized for the P6 core, the 64bit code was optimized for
the K8 (including big function prologues). It probably depends on
the compiler a lot.

> code had a _shorter_ average instruction length due to its tight
> encoding, where one register operand (ST0) is implicit. Also, x87
> code in general did not need prefixes (e.g. 0x0f, 0x66 etc). FWIW,
> 16-bit x86 integer code (ca 1995) ran about 2.8 bytes/instruction.

Thanks for the information.

-Andi

Yousuf Khan

unread,
May 10, 2004, 2:00:06 AM5/10/04
to

Yes, obviously you could've replaced that entire "mov eax, ..." stuff with a
"mov ah, 0x01" and that would've compressed it down nicely. But that's not
really the point I was trying to make. The point I was trying to make was
that immediate values would clog up the instruction cache not the data
cache. And a big 32-bit immediate would clog up an Icache more than an
old-fashioned 8-bit immediate.

If we're worried about the sizes of variable-length instructions occupying
too much room in Icaches, an instruction with an immediate value would be
one of the largest instructions available.

Yousuf Khan


Terje Mathisen

unread,
May 10, 2004, 2:25:53 AM5/10/04
to
Stefan Monnier wrote:

May I suggest you all take the 70-90 minutes required to watch Bob
Colwell (+ Andy 'Crazy' Glew at one point) explain all this stuff in a
Stanford lecture?

http://stanford-online.stanford.edu/courses/ee380/040218-ee380-100.asx

_Very_ short version: Yes, the P4 was intentionally a GHz speed demon,
but tempered with the need to deliver some actual performance that the
engineers could be comfortable with.

Bob also makes the same argument that I made in a conference
presentation last year: The P4 is quite brittle, i.e. it is too easy to
get stuck with very non-optimal performance for too long.

Terje
PS. Thanks to RM for sending me the link!

Terje Mathisen

unread,
May 10, 2004, 4:13:00 AM5/10/04
to
Yousuf Khan wrote:

> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>
>>Yousuf Khan wrote:
>>
>>>You're right, they shouldn't be much of a problem either. I was
>>>actually thinking of those x86 operand-embedded-in-instruction style
>>>instructions.
>>
>>Huh?
>>
>>Which opcodes would that be?
>>
>>Those with immediate data? Implicit registers.
>
>
> Yes, the immediate data would be the one. Things such as:
>
> mov eax, 0x00000001
>
> Where that final 32-bit number "1" would occupy a full 4 bytes in the
> instruction stream.

Actually, it would not: Values from -128 to +127 are encoded as a single
byte, making the instruction two or three bytes shorter.

However, what's the problem???

Instructions with immediate data are a staple of pretty much every cpu
architecture afaik!


>
> But remind me, which instruction forms are the implicit registers?

MUL/DIV/SH*/SAR/R*R/R*L/LOOP*/CBW/LODS/STOS/MOVS/IN/OUT/...

Grumble

unread,
May 10, 2004, 4:37:31 AM5/10/04
to
Stephen Sprunk wrote:

> Smaller caches provide lower latency, which was supposedly
> the justification for the anemic L1D and trace caches in
> Willamette/Northwood. Prescott Doubles the L1D size and the
> associativity, but at the cost of increasing the latency from
> one (two?) cycles to four cycles. Based on performance results
> to date, this is appears to be a wash.

IA-32 Optimization Reference Manual
http://intel.com/design/pentium4/manuals/24896610.pdf
Table 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters

Northwood
L1 = 8 KB, 4-way, 2/9 cycles INT/FP latency, write-through
L2 = 512 KB, 8-way, 9/16 cycles INT/FP latency, write-back

Prescott
L1 = 16 KB, 8-way, 4/12 cycles INT/FP latency, write-through
L2 = 1 MB, 8-way, 22/30 cycles INT/FP latency, write-back


L2 latency took a hit too :-)

Nick Maclaren

unread,
May 10, 2004, 4:56:04 AM5/10/04