Hammer multiply

6 views
Skip to first unread message

Joe Keane

unread,
Jan 13, 2002, 7:14:19 AM1/13/02
to
Here's my report on Hammer multiply.

In short, the multiply support is the same as x86, except you have
prefixes to specify 64-bit operands and use more registers.

There are several forms that give single-word results, plus a couple
that give a double-word result in RDX:RAX.

I approve.

Providing a double-word result is of course critical for some things.
Some architectures don't have that, because they're dumb.

The additional registers are nice. When writing x86 code, i sometimes
think i really could use just one or two extra registers.

I don't know about the instruction timing, but seeing how the Athlon
beats everything except an Alpha, i bet it's fairly good.

--
Joe Keane, amateur mathematician

McCalpin

unread,
Jan 13, 2002, 12:45:50 PM1/13/02
to
In article <vSe08.2608$gf.6...@sjc-read.news.verio.net>,

Joe Keane <j...@jgk.org> wrote:
>
>I don't know about the instruction timing, but seeing how the Athlon
>beats everything except an Alpha, i bet it's fairly good.

Excuse me....

Where does an Athlon beat a POWER4 in performance?
--
John D. McCalpin, Ph.D. mcca...@austin.ibm.com
Senior Technical Staff Member IBM POWER Microprocessor Development
"I am willing to make mistakes as long as
someone else is willing to learn from them."

Niels Jørgen Kruse

unread,
Jan 13, 2002, 5:36:48 PM1/13/02
to
I artiklen <a1sh4e$9ei$1...@ausnews.austin.ibm.com> ,
mcca...@gmp246.austin.ibm.com (McCalpin) skrev:

> In article <vSe08.2608$gf.6...@sjc-read.news.verio.net>,
> Joe Keane <j...@jgk.org> wrote:
>>
>>I don't know about the instruction timing, but seeing how the Athlon
>>beats everything except an Alpha, i bet it's fairly good.
>
> Excuse me....
>
> Where does an Athlon beat a POWER4 in performance?

I think he was talking about multiple precision multiply. What are the
latencies and throughput for the POWER4 multiplier?

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Sander Vesik

unread,
Jan 13, 2002, 6:10:10 PM1/13/02
to
McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
> In article <vSe08.2608$gf.6...@sjc-read.news.verio.net>,
> Joe Keane <j...@jgk.org> wrote:
>>
>>I don't know about the instruction timing, but seeing how the Athlon
>>beats everything except an Alpha, i bet it's fairly good.

> Excuse me....

> Where does an Athlon beat a POWER4 in performance?

Do you sell Power4 desktops single/dual cpu machines yet?

And I bet it beats power4 in perfomance you get for a unit amount of money 8-)

> --
> John D. McCalpin, Ph.D. mcca...@austin.ibm.com
> Senior Technical Staff Member IBM POWER Microprocessor Development
> "I am willing to make mistakes as long as
> someone else is willing to learn from them."

--
Sander

+++ Out of cheese error +++

Brannon Batson

unread,
Jan 14, 2002, 2:00:54 AM1/14/02
to
Sander Vesik <san...@haldjas.folklore.ee> wrote in message news:<10109634...@haldjas.folklore.ee>...

> McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
> > In article <vSe08.2608$gf.6...@sjc-read.news.verio.net>,
> > Joe Keane <j...@jgk.org> wrote:
> >>
> >>I don't know about the instruction timing, but seeing how the Athlon
> >>beats everything except an Alpha, i bet it's fairly good.
>
> > Excuse me....
>
> > Where does an Athlon beat a POWER4 in performance?
>
> Do you sell Power4 desktops single/dual cpu machines yet?

Let's see. We already had an Alpha mention. Now we have someone
claiming that Power4 doesn't matter because it doesn't sell to the
desktop market. This has got to be the shortest thread to ever hit
these two comp.arch laws (are there more?).

How about I just call you all Nazi's and we end this thread right here

Brannon

> [snip]

Bruce Hoult

unread,
Jan 14, 2002, 2:14:43 AM1/14/02
to
In article <4495ef1f.02011...@posting.google.com>,
Brannon...@yahoo.com (Brannon Batson) wrote:

> > Do you sell Power4 desktops single/dual cpu machines yet?
>
> Let's see. We already had an Alpha mention. Now we have someone
> claiming that Power4 doesn't matter because it doesn't sell to the
> desktop market.

Last week I watched the MacWorld keynote with a friend who owns an Apple
VAR, and we happened to talk about the Power4. While not being aware of
exactly how much IBM sells them for, he seemed pretty sure that he could
sell a good number of higher powered Macs even if they were much more
expensive, provided that they ran Photoshop or Maya or whatever faster.

He thought the downside to Apple rebadging IBM hardware (with OS X on
it) i they didn't want to do the R&D themselves would be zero and the
mere existance in the catalogue of something like that as an upgrade
path would make it easier to sell G4s in some quarters.

-- Bruce

Erik Corry

unread,
Jan 14, 2002, 4:51:51 AM1/14/02
to
Bruce Hoult <br...@hoult.org> wrote:

> Last week I watched the MacWorld keynote with a friend who owns an Apple
> VAR, and we happened to talk about the Power4. While not being aware of
> exactly how much IBM sells them for, he seemed pretty sure that he could
> sell a good number of higher powered Macs even if they were much more
> expensive, provided that they ran Photoshop or Maya or whatever faster.

Is the lack of Altivec a problem?

--
Erik Corry er...@arbat.com
Interviewer: "Real programmers use cat as their editor."
Bill Joy: "That's right! There you go! It is too much trouble to say ed,
because cat's smaller and only needs two pages of memory."

Sander Vesik

unread,
Jan 14, 2002, 5:35:03 AM1/14/02
to
Brannon Batson <Brannon...@yahoo.com> wrote:
> Sander Vesik <san...@haldjas.folklore.ee> wrote in message news:<10109634...@haldjas.folklore.ee>...
>> McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
>> > In article <vSe08.2608$gf.6...@sjc-read.news.verio.net>,
>> > Joe Keane <j...@jgk.org> wrote:
>> >>
>> >>I don't know about the instruction timing, but seeing how the Athlon
>> >>beats everything except an Alpha, i bet it's fairly good.
>>
>> > Excuse me....
>>
>> > Where does an Athlon beat a POWER4 in performance?
>>
>> Do you sell Power4 desktops single/dual cpu machines yet?

> Let's see. We already had an Alpha mention. Now we have someone
> claiming that Power4 doesn't matter because it doesn't sell to the
> desktop market. This has got to be the shortest thread to ever hit
> these two comp.arch laws (are there more?).

I didn't say it doesn't matter. If power4 does not sell to the workstation
market it is trivialy easy to beat it there - a 486 is sufficent.

> Brannon

Ketil Z Malde

unread,
Jan 14, 2002, 8:36:45 AM1/14/02
to
Bruce Hoult <br...@hoult.org> writes:

> Last week I watched the MacWorld keynote with a friend who owns an Apple
> VAR, and we happened to talk about the Power4. While not being aware of
> exactly how much IBM sells them for, he seemed pretty sure that he could
> sell a good number of higher powered Macs even if they were much more
> expensive, provided that they ran Photoshop or Maya or whatever faster.

Apparently, a good lot of machines -- for personal use, that is --
sold are high end. Most places I've worked seems to insist on just
about the most full-featured equipment they can get.

Basically, even a very high end PC is cheap compared to labor costs,
so I suppose there's a market for even faster computers.

-kzm
--
If I haven't seen further, it is by standing in the footprints of giants

Jan C. Vorbrüggen

unread,
Jan 14, 2002, 8:41:56 AM1/14/02
to
> There are several forms that give single-word results, plus a couple
> that give a double-word result in RDX:RAX.

The Alpha architecture doesn't have those on purpose: it destroys instruction
regularity, in particular by requiring extra decoding
interdependencies and at least one extra register write port. On
Alpha, you get to choose whether you want the lower or the upper
word of the result instead.

Jan

Bernd Paysan

unread,
Jan 14, 2002, 9:47:44 AM1/14/02
to
"Jan C. Vorbrüggen" schrieb:

However, nothing prevents AMD from decoding a double-word mul into two
operations.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Bruce Hoult

unread,
Jan 14, 2002, 7:47:44 PM1/14/02
to
In article <a1u9nn$u94$1...@news.net.uni-c.dk>, Erik Corry
<er...@arbat.com> wrote:

> Bruce Hoult <br...@hoult.org> wrote:
>
> > Last week I watched the MacWorld keynote with a friend who owns
> > an Apple VAR, and we happened to talk about the Power4. While
> > not being aware of exactly how much IBM sells them for, he
> > seemed pretty sure that he could sell a good number of higher
> > powered Macs even if they were much more expensive, provided
> > that they ran Photoshop or Maya or whatever faster.
>
> Is the lack of Altivec a problem?

Not if a Power4 (or 3) is faster than a G4 even without it :-)

IBM had the chance to adopt Altivec, but appeared to think they could
build faster machines without it. Whether that meant faster on vector
codes or only faster on scalar I don't know.

If the Power machines have huge memory and disk bandwidth then that
might make them worthwhile on a lot of Photoshop-type work even without
Altivec. Not everything is gaussian blur or rotate.

There aren't yet many programs that *require* Altivec. iDVD requires a
G4, and I assume that's because of Altivec.


Even without Photoshop and Sorenson encoding and so forth, if Apple is
going to break into the corporate market then they need some real server
hardware with all the good stuff that Suns and Intel-based servers have.
IBM has such machines which presumably shouldn't be *too* hard to get OS
X going on.


I really don't know whether Power3 or Power4 Macs would be any faster
than G4 Macs. I sure hope someone at Apple *does* know and the
non-existence of them is for good reasons rather than because no one
checked.

-- Bruce

Christopher Brian Colohan

unread,
Jan 14, 2002, 9:14:03 PM1/14/02
to
Bruce Hoult <br...@hoult.org> writes:
> Not if a Power4 (or 3) is faster than a G4 even without it :-)
>
> IBM had the chance to adopt Altivec, but appeared to think they could
> build faster machines without it. Whether that meant faster on vector
> codes or only faster on scalar I don't know.

Rememeber, Power4 is currently geared towards database and scientific
workloads folks. Neither of these markets use the machines to play
DVDs or edit 2d images. I would guess that multimedia workloads don't
even appear on the designer's radar for these first designs.

If a version of Power4 is created for the consumer/artist market, I
would presume IBM would make sure multimedia performance was decent.
Perhaps they will bolt an Altavec (or Altavec like unit) onto the side
of the design...

Chris
--
Chris Colohan Email: ch...@colohan.ca PGP: finger col...@cs.cmu.edu
Web: www.cs.cmu.edu/~colohan Phone: (412)268-4751

Bruce Hoult

unread,
Jan 14, 2002, 9:37:01 PM1/14/02
to
In article <ucln0zg...@gs138.sp.cs.cmu.edu>, Christopher Brian
Colohan <colo...@cs.cmu.edu> wrote:

> Bruce Hoult <br...@hoult.org> writes:
> > Not if a Power4 (or 3) is faster than a G4 even without it :-)
> >
> > IBM had the chance to adopt Altivec, but appeared to think they could
> > build faster machines without it. Whether that meant faster on vector
> > codes or only faster on scalar I don't know.
>
> Rememeber, Power4 is currently geared towards database and scientific
> workloads folks.

Right. I would have thought that the scientific workload folks were
interested in Altivec -- or at least the ones or whom single precision
is enough.


> Neither of these markets use the machines to play
> DVDs or edit 2d images. I would guess that multimedia workloads don't
> even appear on the designer's radar for these first designs.

While there are some things in Photoshop which are CPU limited, I would
have thought that there was a lot more that were memory bandwidth
limited and which would therefore benefit from the kinds of machines IBM
makes.


> If a version of Power4 is created for the consumer/artist market, I
> would presume IBM would make sure multimedia performance was decent.

But do we know that it's bad at the moment?

-- Bruce

Hugh Fisher

unread,
Jan 15, 2002, 4:31:52 AM1/15/02
to
In article <bruce-9317EB....@news.paradise.net.nz>, Bruce Hoult
<br...@hoult.org> wrote:

> IBM had the chance to adopt Altivec, but appeared to think they could
> build faster machines without it. Whether that meant faster on vector
> codes or only faster on scalar I don't know.

There were rumours at the time (sorry, no actual evidence) that
IBM didn't want to build AltiVec chips because their designers
thought AltiVec would be difficult to implement at high clock
speeds. With the IBM G3s now hitting 1Ghz, that looks a pretty
good call in hindsight.

The new Nintendo has an IBM PPC variant in it. Presumably this
has some kind of SIMD instruction set (if only for bragging
rights compared to Sony and XBox). Anybody know anything about
it?

> I really don't know whether Power3 or Power4 Macs would be any faster
> than G4 Macs. I sure hope someone at Apple *does* know and the
> non-existence of them is for good reasons rather than because no one
> checked.

Too small a market niche probably. Apple did in the past
rebadge some IBM servers, but they were never really popular
and eventually Apple gave up on them. One of the clone makers
(when Mac clones were allowed) tried to build massive crunchers
for the graphics market, but that wasn't viable either.

Presumably it is too expensive, and takes too long, to design
a specialised box that will sell in sufficient numbers before
the next iteration of the regular desktop machines overtakes
it.

Hugh Fisher

Martin Høyer Kristiansen

unread,
Jan 15, 2002, 3:38:10 AM1/15/02
to
Hugh Fisher wrote:
> The new Nintendo has an IBM PPC variant in it. Presumably this
> has some kind of SIMD instruction set (if only for bragging
> rights compared to Sony and XBox). Anybody know anything about
> it?

It packs two floats into one 64bit FP register, much like AMD's 3Dnow!.
With multiply-add this results in 4 floating point operations per cycle,
~1.8 GFLOPS in all.

Cheers
Martin

McCalpin

unread,
Jan 15, 2002, 7:56:09 AM1/15/02
to
In article <bruce-9317EB....@news.paradise.net.nz>,
Bruce Hoult <br...@hoult.org> wrote:
>I really don't know whether Power3 or Power4 Macs would be any faster
>than G4 Macs. I sure hope someone at Apple *does* know and the
>non-existence of them is for good reasons rather than because no one
>checked.

Although I have no dealings with Apple (other than as a long-time
Mac user), I suspect that the current POWER4 chips have some
features that are disadvantages from Apple's point of view.

The POWER4 chips dissipate a lot of heat. I don't know what we
have disclosed in public, but it is certainly in excess of 100
watts for the dual-cpu chip.

The POWER4 chips are large (over 400 mm^2) and are therefore not
cheap.

Of course, future process technologies will make the chips smaller,
cheaper, and cooler.

Jonathan Thornburg

unread,
Jan 15, 2002, 9:41:48 AM1/15/02
to
In article <bruce-51D19B....@news.paradise.net.nz>,

Bruce Hoult <br...@hoult.org> wrote:
>I would have thought that the scientific workload folks were
>interested in Altivec -- or at least the ones or whom single precision
>is enough.

Most scientific computing is done, and most scientific code is written,
in 64-bit floating-point. ("Real supercomputing" used to define 64-bit
fp as "single precision" and 128-bit as "double precision", but this
usage is less common today.)

That's not to say there aren't a few areas where 32-bit fp suffices.
But they're relatively uncommon even within the number-crunching world.
So alas, Altivec isn't much use for "normal" (64-bit fp) scientific
work.

--
-- Jonathan Thornburg <jth...@aei.mpg.de>
Max-Planck-Institut fuer Gravitationsphysik (Albert-Einstein-Institut),
Golm, Germany http://www.aei.mpg.de/~jthorn/home.html
"Washing one's hands of the conflict between the powerful and the
powerless means to side with the powerful, not to be neutral."
-- quote by Freire / poster by Oxfam

ttk_ciar

unread,
Jan 15, 2002, 1:05:53 PM1/15/02
to

Once upon a time, Bruce Hoult <br...@hoult.org> said:
> Date: Tue, 15 Jan 2002 13:47:44 +1300

>>
>> Is the lack of Altivec a problem?
>
>Not if a Power4 (or 3) is faster than a G4 even without it :-)

Once upon a time, IBM was sticking PowerPC 604 processors into
some of their boxes and running POWER binaries on them. They'd
handle the missing instructions by simulating them on invalid
opcode trap, which was very slow, but since the 604 was running
most instructions so much faster than the POWER implementations
of the time (might have been "POWER implementations of similar
price bracket"; I don't recall exactly), the net performance was
very good.

It seems to me that apple might do the same thing with POWER4,
handling the AltiVec instructions via trap, if the resulting
system was much faster for most applications (ie, those that did
not rely very heavily on AltiVec).

-- TTK

Peter Boyle

unread,
Jan 15, 2002, 3:20:18 PM1/15/02
to

On 15 Jan 2002, TTK Ciar wrote:

>
> Once upon a time, Bruce Hoult <br...@hoult.org> said:

> >> Is the lack of Altivec a problem?
> >
> >Not if a Power4 (or 3) is faster than a G4 even without it :-)
>

[Snipped emulate missing instruction]

> It seems to me that apple might do the same thing with POWER4,
> handling the AltiVec instructions via trap, if the resulting
> system was much faster for most applications (ie, those that did
> not rely very heavily on AltiVec).

And those that do will have G3 alternative code paths in any case.
Peter

> -- TTK
>
>

Paul Wallich

unread,
Jan 15, 2002, 3:27:48 PM1/15/02
to

Ideally, of course, one would trap to a dynamic recompilation
routine of appropriate complexity...

paul

Niels Jørgen Kruse

unread,
Jan 15, 2002, 4:49:59 PM1/15/02
to
I artiklen <a218t9$fv4$1...@ausnews.austin.ibm.com> ,
mcca...@gmp246.austin.ibm.com (McCalpin) skrev:

> The POWER4 chips are large (over 400 mm^2) and are therefore not
> cheap.
>
> Of course, future process technologies will make the chips smaller,
> cheaper, and cooler.

I have been thinking that the low CPU count / lower margin systems are
waiting on a shrink of the POWER4.

Bruce Hoult

unread,
Jan 15, 2002, 5:01:05 PM1/15/02
to
In article <pw-150102...@192.168.1.100>, p...@panix.com (Paul
Wallich) wrote:

There's no need, since most applications containing AltiVec code have
AltiVec and non-AltiVec code which they select at runtime so they can
run on all those iMacs and iBooks.

-- Bruce

Niels Jørgen Kruse

unread,
Jan 15, 2002, 5:48:54 PM1/15/02
to
I artiklen <laranzu-1501...@1cust23.tnt3.cbr1.da.uu.net> ,
lar...@ozemail.deletethis.com.au (Hugh Fisher) skrev:

> There were rumours at the time (sorry, no actual evidence) that
> IBM didn't want to build AltiVec chips because their designers
> thought AltiVec would be difficult to implement at high clock
> speeds. With the IBM G3s now hitting 1Ghz, that looks a pretty
> good call in hindsight.

What is so difficult about dividing a (vector) pipeline into more stages?

Trying to keep more instructions in flight (than the G3) without splitting a
stage is another matter.

Douglas Siebert

unread,
Jan 15, 2002, 7:40:13 PM1/15/02
to
"Niels J=?ISO-8859-1?B?+A==?=rgen Kruse" <nj_k...@get2net.dk> writes:

>I have been thinking that the low CPU count / lower margin systems are
>waiting on a shrink of the POWER4.


If so, they'll have to wait a while. IIRC, its already made in .13u, .10u
is going to be a while. They are probably saving up POWER4 dies where one
core tested good and one tested bad to be used in low end desktops.

--
Douglas Siebert dsie...@excisethis.khamsin.net

He who laughs last thinks slowest.

del cecchi

unread,
Jan 15, 2002, 9:01:40 PM1/15/02
to

"Douglas Siebert" <dsie...@excisethis.khamsin.net> wrote in message
news:a22i5d$rsk$1...@sword.avalon.net...

> "Niels J=?ISO-8859-1?B?+A==?=rgen Kruse" <nj_k...@get2net.dk> writes:
>
> >I have been thinking that the low CPU count / lower margin systems
are
> >waiting on a shrink of the POWER4.
>
>
> If so, they'll have to wait a while. IIRC, its already made in .13u,
.10u
> is going to be a while. They are probably saving up POWER4 dies where
one
> core tested good and one tested bad to be used in low end desktops.

Already selling them for HPC. Besides we got SSTAR and 630 for that.

aaron spink

unread,
Jan 16, 2002, 1:19:38 AM1/16/02
to

"Douglas Siebert" <dsie...@excisethis.khamsin.net> wrote in message
news:a22i5d$rsk$1...@sword.avalon.net...
> If so, they'll have to wait a while. IIRC, its already made in .13u, .10u
> is going to be a while. They are probably saving up POWER4 dies where one
> core tested good and one tested bad to be used in low end desktops.
>
IIRC, Power4 is done in IBM's CMOS 8 SOI process which is a 180nm nominal
critical dimension SOI substrate, copper metal process. I do not believe
that IBM is shipping 130nm Power4 cpus but plans to sometime around the end
of the year.

Aaron Spink
speaking for Myself Inc.


Anil T Maliyekke

unread,
Jan 16, 2002, 2:58:49 AM1/16/02
to
McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
> Although I have no dealings with Apple (other than as a long-time
> Mac user), I suspect that the current POWER4 chips have some
> features that are disadvantages from Apple's point of view.

> The POWER4 chips dissipate a lot of heat. I don't know what we
> have disclosed in public, but it is certainly in excess of 100
> watts for the dual-cpu chip.

> The POWER4 chips are large (over 400 mm^2) and are therefore not
> cheap.

> Of course, future process technologies will make the chips smaller,
> cheaper, and cooler.
> --
> John D. McCalpin, Ph.D. mcca...@austin.ibm.com


Plus there is that issue with cache line size and Apple's use of dcbz


McCalpin

unread,
Jan 16, 2002, 7:32:33 AM1/16/02
to
In article <a22i5d$rsk$1...@sword.avalon.net>,

Douglas Siebert <dsie...@excisethis.khamsin.net> wrote:
>"Niels J=?ISO-8859-1?B?+A==?=rgen Kruse" <nj_k...@get2net.dk> writes:
>
>>I have been thinking that the low CPU count / lower margin systems are
>>waiting on a shrink of the POWER4.
>
>If so, they'll have to wait a while. IIRC, its already made in .13u, .10u
>is going to be a while. They are probably saving up POWER4 dies where one
>core tested good and one tested bad to be used in low end desktops.

YRI. (You recall incorrectly.) ;-)

POWER4 is made in the IBM 8S3 process, which is 0.18 microns.

The 0.13 micron process (IBM 9S) is operational, and we have POWER4
(shrink) parts in the lab produced in the this process, but we have
not publicly announced any of the products that contain these
parts. The CMOS 9S process was first announced in December, 2000.

Carlie Coats

unread,
Jan 16, 2002, 8:08:38 AM1/16/02
to
In article <ucln0zg...@gs138.sp.cs.cmu.edu>,
Christopher Brian Colohan <colo...@cs.cmu.edu> wrote:
> Bruce Hoult <br...@hoult.org> writes:
> > Not if a Power4 (or 3) is faster than a G4 even without it :-)
> >
> > IBM had the chance to adopt Altivec, but appeared to think they could
> > build faster machines without it. Whether that meant faster on vector
> > codes or only faster on scalar I don't know.
>
> Rememeber, Power4 is currently geared towards database and scientific
> workloads folks. Neither of these markets use the machines to play
> DVDs or edit 2d images. I would guess that multimedia workloads don't
> even appear on the designer's radar for these first designs.

On the other hand, multimedia processing has a *lot* more in common
with scientific workloads than it does with office-style apps. And the
POWER architecture *does* have a streaming load-store unit that helps a
lot with these sorts of things (better for multimedia than it does for
some scientific work-loads, actually: the latter are too likely to
have too many non-stride-1 streams for it to handle well).

Given my experience optimizing scientific loads on it (and on other
systems), I strongly suspect they *can* do at least as well on simpler
vector codes with their existing architecture than with Altivec.


Carlie J. Coats, Jr. co...@emc.mcnc.org
MCNC-Environmental Modeling Center phone: (919)248-9241
North Carolina Supercomputing Center fax: (919)248-9245
3021 Cornwallis Road P. O. Box 12889
Research Triangle Park, N. C. 27709-2889 USA
"My opinions are my own, and I've got *lots* of them!"


Holger Bettag

unread,
Jan 16, 2002, 8:48:13 AM1/16/02
to
"Niels J=?ISO-8859-1?B?+A==?=rgen Kruse" <nj_k...@get2net.dk> writes:

>
> I artiklen <laranzu-1501...@1cust23.tnt3.cbr1.da.uu.net> ,
> lar...@ozemail.deletethis.com.au (Hugh Fisher) skrev:
>
> > There were rumours at the time (sorry, no actual evidence) that
> > IBM didn't want to build AltiVec chips because their designers
> > thought AltiVec would be difficult to implement at high clock
> > speeds. With the IBM G3s now hitting 1Ghz, that looks a pretty
> > good call in hindsight.
>
> What is so difficult about dividing a (vector) pipeline into more stages?
>

I guess nothing. Compare the vector ALUs of MPC7400 and MPC7450.

Holger

Christian Bau

unread,
Jan 17, 2002, 3:01:12 AM1/17/02
to

I would just be curious to know how many bytes a dcbz will set to zero
on a POWER4. I wouldn't be surprised at all if the number is 32, same as
on all PowerPC processors ever built, no matter what the cache line size
is. Unfortunately I have never been able to find a POWER4 instruction
manual.

Holger Bettag

unread,
Jan 17, 2002, 9:10:20 AM1/17/02
to
Christian Bau <christ...@cbau.freeserve.co.uk> writes:

[...]


> I would just be curious to know how many bytes a dcbz will set to zero
> on a POWER4. I wouldn't be surprised at all if the number is 32, same as
> on all PowerPC processors ever built, no matter what the cache line size
> is. Unfortunately I have never been able to find a POWER4 instruction
> manual.

I could be mistaken, but I believe there exist some embedded PPCs where dcbz
is either unimplemented or clears only 16 bytes (the respective cache line
size).

Holger

McCalpin

unread,
Jan 17, 2002, 10:42:06 AM1/17/02
to
In article <3C4684C9...@cbau.freeserve.co.uk>,

Christian Bau <christ...@cbau.freeserve.co.uk> wrote:
>Anil T Maliyekke wrote:
>>
>> McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
>> > Although I have no dealings with Apple (other than as a long-time
>> > Mac user), I suspect that the current POWER4 chips have some
>> > features that are disadvantages from Apple's point of view.
>>
>> > The POWER4 chips dissipate a lot of heat. I don't know what we
>> > have disclosed in public, but it is certainly in excess of 100
>> > watts for the dual-cpu chip.
>>
>> > The POWER4 chips are large (over 400 mm^2) and are therefore not
>> > cheap.
>>
>> > Of course, future process technologies will make the chips smaller,
>> > cheaper, and cooler.
>> > --
>> > John D. McCalpin, Ph.D. mcca...@austin.ibm.com
>>
>> Plus there is that issue with cache line size and Apple's use of dcbz
>
>I would just be curious to know how many bytes a dcbz will set to zero
>on a POWER4.

The number is 128.


--
John D. McCalpin, Ph.D. mcca...@austin.ibm.com

Andrew Mobbs

unread,
Jan 17, 2002, 12:16:39 PM1/17/02
to
McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
>
>The POWER4 chips dissipate a lot of heat. I don't know what we
>have disclosed in public, but it is certainly in excess of 100
>watts for the dual-cpu chip.

Some IBM sales people were telling us a few weeks ago how energy
efficient and cheap Regatta was compared to Sun and IA64 servers. They
had multi-coloured graphs to prove it.

--
Andrew Mobbs - http://www.chiark.greenend.org.uk/~andrewm/

McCalpin

unread,
Jan 17, 2002, 1:53:53 PM1/17/02
to
In article <R6s*9g...@news.chiark.greenend.org.uk>,

Andrew Mobbs <and...@chiark.greenend.org.uk> wrote:
>McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
>>
>>The POWER4 chips dissipate a lot of heat. I don't know what we
>>have disclosed in public, but it is certainly in excess of 100
>>watts for the dual-cpu chip.
>
>Some IBM sales people were telling us a few weeks ago how energy
>efficient and cheap Regatta was compared to Sun and IA64 servers. They
>had multi-coloured graphs to prove it.

And their multi-colored graphs are correct, too!

What they are showing is the amount of power required to deliver a
certain level of performance on benchmarks for large SMP systems.
POWER4 performance is quite high per processor, and since there are
two processors per chip, the power per processor is high, but not
unreasonable. In combination, the power required for high fixed
levels of performance is very good.

But this analysis would not be very helpful in determining an
appropriate cpu for a laptop computer, for example, or even
a computer for an office environment with strict limits on
fan loudness.

Del Cecchi

unread,
Jan 17, 2002, 2:27:27 PM1/17/02
to
In article <R6s*9g...@news.chiark.greenend.org.uk>,

And you are claiming that these claims are inconsistent? How many watts/TPM-C
are the repective boxen?
--

Del Cecchi
cecchi@rchland

Toon Moene

unread,
Jan 17, 2002, 3:15:46 PM1/17/02
to
McCalpin wrote:

> The 0.13 micron process (IBM 9S) is operational, and we have POWER4
> (shrink) parts in the lab produced in the this process, but we have
> not publicly announced any of the products that contain these
> parts. The CMOS 9S process was first announced in December, 2000.

Which makes me think - I still need a 64bit, big-endian system to
complete the set:

32-bit little-endian Intel Pentium II
64-bit little-endian Alpha 21264
32-bit big-endian PowerPC G4
64-bit big-endian IBM Z-series

When's the 64-bit S/390 laptop planned ?

--
Toon Moene - mailto:to...@moene.indiv.nluug.nl - phoneto: +31 346 214290
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
Maintainer, GNU Fortran 77: http://gcc.gnu.org/onlinedocs/g77_news.html
Join GNU Fortran 95: http://g95.sourceforge.net/ (under construction)

Andrew Duane USG

unread,
Jan 17, 2002, 3:44:03 PM1/17/02
to
In article <R6s*9g...@news.chiark.greenend.org.uk> Andrew Mobbs <and...@chiark.greenend.org.uk> writes:
>McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
>>The POWER4 chips dissipate a lot of heat. I don't know what we
>>have disclosed in public, but it is certainly in excess of 100
>>watts for the dual-cpu chip.
>
>Some IBM sales people were telling us a few weeks ago how energy
>efficient and cheap Regatta was compared to Sun and IA64 servers. They
>had multi-coloured graphs to prove it.

CPU power != system power

this applies to throughput as well as watts dissipated :-)

--

Andrew L. Duane (JOT-7) du...@zk3.dec.com
Compaq Computer Corporation (603)-884-1294
110 Spit Brook Road
M/S ZKO3-3/U14
Nashua, NH 03062-2698


Pete Zaitcev

unread,
Jan 17, 2002, 3:45:30 PM1/17/02
to
> Which makes me think - I still need a 64bit, big-endian system to
> complete the set:
>
> 32-bit little-endian Intel Pentium II
> 64-bit little-endian Alpha 21264
> 32-bit big-endian PowerPC G4
> 64-bit big-endian IBM Z-series
>
> When's the 64-bit S/390 laptop planned ?

For the moment, you can use a SPARC laptop. They are not
cheap, but they do exist. Unfortunately, no USIII based
laptop is out there yet.

-- Pete

Greg Lindahl

unread,
Jan 17, 2002, 3:51:36 PM1/17/02
to
In article <3C4730F2...@moene.indiv.nluug.nl>,
Toon Moene <to...@moene.indiv.nluug.nl> wrote:

>Which makes me think - I still need a 64bit, big-endian system to
>complete the set:
>
>32-bit little-endian Intel Pentium II
>64-bit little-endian Alpha 21264
>32-bit big-endian PowerPC G4
>64-bit big-endian IBM Z-series

Almost all HPC 64-bit systems are big endian, so buy yourself a Cray
vector machine used and be done with it...

g

Chris Morgan

unread,
Jan 17, 2002, 4:55:31 PM1/17/02
to
zai...@yahoo.com (Pete Zaitcev) writes:

> For the moment, you can use a SPARC laptop. They are not
> cheap, but they do exist. Unfortunately, no USIII based
> laptop is out there yet.

If it's $8k for an UltraSPARC-IIe based laptop, I figure an
ultrasparc-III based one would have to be $80k

Chris

-only half kidding
--
Chris Morgan

Peter Boyle

unread,
Jan 17, 2002, 5:01:50 PM1/17/02
to

Hi,

There was a thread on this a few months ago, and John McCalpin
kindly posted the following, which I have lifted:

Peter

Icache Dcache L2 cache L3 cache
------------------------------------------------
POWER1 64 64 N/A N/A
POWER2 128 64/128/256 ??? N/A

PPC 601 64(32) external N/A
PPC 603 32 32 external N/A

POWER3 128 128 128 N/A
RS64 128 128 128 N/A
POWER4 128(32) 128 128 512(128)
------------------------------------------------
Notes:
1. The notation "128(32)" means: 128 byte line with 32 byte sectors
2. The notation 64/128/256 means that various models
were available with each of the indicated Dcache line sizes.
3. The PowerPC 601 had a unified L1 cache.
4. I was unable to find documentation on the L2 line size
for the POWER2 machines with L2 caches. I never happened
to own one myself, and don't have any handy right now, either....
5. The notation "external" means that any cache at this level
is on the processor's memory bus and is not defined by the
processor architecture (and may not even be visible to the
processor).
6. All of the processors except POWER1 and POWER2 are PowerPC
compliant.
------------------------------------------------


Peter Boyle pbo...@physics.gla.ac.uk

Toon Moene

unread,
Jan 17, 2002, 6:06:25 PM1/17/02
to
Greg Lindahl wrote:

> I wrote:
>
> >Which makes me think - I still need a 64bit, big-endian system to
> >complete the set:
> >
> >32-bit little-endian Intel Pentium II
> >64-bit little-endian Alpha 21264
> >32-bit big-endian PowerPC G4
> >64-bit big-endian IBM Z-series
>
> Almost all HPC 64-bit systems are big endian, so buy yourself a Cray
> vector machine used and be done with it...

Doh ! I didn't say "laptop" for nothing - obviously I need something
with a reasonable power consumption ...

McCalpin

unread,
Jan 17, 2002, 5:59:52 PM1/17/02
to
>McCalpin wrote:
>
>> The 0.13 micron process (IBM 9S) is operational, and we have POWER4
>> (shrink) parts in the lab produced in the this process, but we have
>> not publicly announced any of the products that contain these
>> parts. The CMOS 9S process was first announced in December, 2000.
>
>Which makes me think - I still need a 64bit, big-endian system to
>complete the set:
>
>32-bit little-endian Intel Pentium II
>64-bit little-endian Alpha 21264
>32-bit big-endian PowerPC G4
>64-bit big-endian IBM Z-series

POWER3 systems are 64-bit big-endian, and you can get one of
those for a bit less than a Z-series machine.

If you don't want to pay for a new one, I just saw five of
them for sale on eBay....


(For that matter, SGI MIPS R10000/R12000 machines are 64-bit
big-endian, and those are really cheap on eBay.)

Brannon Batson

unread,
Jan 17, 2002, 6:46:38 PM1/17/02
to
mcca...@gmp246.austin.ibm.com (McCalpin) wrote in message news:<a276k1$ihg$1...@ausnews.austin.ibm.com>...

> In article <R6s*9g...@news.chiark.greenend.org.uk>,
> Andrew Mobbs <and...@chiark.greenend.org.uk> wrote:
> >McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
> >>
> >>The POWER4 chips dissipate a lot of heat. I don't know what we
> >>have disclosed in public, but it is certainly in excess of 100
> >>watts for the dual-cpu chip.
> >
> >Some IBM sales people were telling us a few weeks ago how energy
> >efficient and cheap Regatta was compared to Sun and IA64 servers. They
> >had multi-coloured graphs to prove it.
>
> And their multi-colored graphs are correct, too!
>
> What they are showing is the amount of power required to deliver a
> certain level of performance on benchmarks for large SMP systems.
> POWER4 performance is quite high per processor, and since there are
> two processors per chip, the power per processor is high, but not
> unreasonable. In combination, the power required for high fixed
> levels of performance is very good.

Who cares about energy efficiency of the system? Does the cost of
electrical power really differentiate total cost of ownership?

It seems to me that power really only matters (to a customer) in terms
of density and heat dissapation. For example, a lower-power system
can be packed more densely, which means that it can take up less space
in a server room. Also, a lower-power system may require less
expensive cooling facilities. If neither of these apply in the
specific situation for the customer, then all you are really doing is
lowering their power bill by some modest percentage. If you get to
this point in comparing systems, then you are probably off in the
weeds somewhere.

> [snip]

Brannon
not speaking for Intel

Skipper Smith

unread,
Jan 17, 2002, 6:52:06 PM1/17/02
to

Sorry, Christian, but The 620 dcbz zeroed out 64 bytes and the 860 and
its' derivitives all do 16 bytes.

--
Skipper Smith Helpful Knowledge Consulting
Worldwide Microprocessor Architecture Training
PowerPC, ColdFire, 68K, CPU32 Hardware and Software

/* Remove no-spam. from the reply address to send mail directly */

Rupert Pigott

unread,
Jan 17, 2002, 9:04:17 PM1/17/02
to
Brannon Batson <Brannon...@yahoo.com> wrote in message
news:4495ef1f.02011...@posting.google.com...
[SNIP]

> Who cares about energy efficiency of the system? Does the cost of
> electrical power really differentiate total cost of ownership?
>
> It seems to me that power really only matters (to a customer) in terms
> of density and heat dissapation. For example, a lower-power system

Certainly makes a difference if you're buying a large number of
units, OR building a new facility. Hell, even if you have a
crowded site, a machine which doesn't need lots of floor space
or cooling gives you more options on where you can put it.

If it shaves a few pennies off the electricity bill that's a
bonus. :)

Cheers,
Rupert


Bill Todd

unread,
Jan 17, 2002, 9:53:43 PM1/17/02
to

"Brannon Batson" <Brannon...@yahoo.com> wrote in message
news:4495ef1f.02011...@posting.google.com...

...

> Who cares about energy efficiency of the system? Does the cost of
> electrical power really differentiate total cost of ownership?

I'll assume you're not interested in laptop power issues...

>
> It seems to me that power really only matters (to a customer) in terms
> of density and heat dissapation. For example, a lower-power system
> can be packed more densely, which means that it can take up less space
> in a server room. Also, a lower-power system may require less
> expensive cooling facilities. If neither of these apply in the
> specific situation for the customer, then all you are really doing is
> lowering their power bill by some modest percentage. If you get to
> this point in comparing systems, then you are probably off in the
> weeds somewhere.

It's hardly difficult to make at least an order-of-magnitude approximation.
A 100W processor costs about a penny per hour to run (at my local electric
rates). If your facility is running a heat surplus, add another couple of
cents per hour for cooling costs. That's something like
$250/year/processor: neither dramatic nor insignificant in lower-end
systems (e.g., Google-style server farms), but probably down in the noise
level for higher-end installations.

- bill

Douglas Siebert

unread,
Jan 17, 2002, 11:37:58 PM1/17/02
to
mcca...@gmp246.austin.ibm.com (McCalpin) writes:

>In article <3C4730F2...@moene.indiv.nluug.nl>,
>Toon Moene <to...@moene.indiv.nluug.nl> wrote:
>>McCalpin wrote:
>>
>>> The 0.13 micron process (IBM 9S) is operational, and we have POWER4
>>> (shrink) parts in the lab produced in the this process, but we have
>>> not publicly announced any of the products that contain these
>>> parts. The CMOS 9S process was first announced in December, 2000.
>>
>>Which makes me think - I still need a 64bit, big-endian system to
>>complete the set:
>>
>>32-bit little-endian Intel Pentium II
>>64-bit little-endian Alpha 21264
>>32-bit big-endian PowerPC G4
>>64-bit big-endian IBM Z-series

>POWER3 systems are 64-bit big-endian, and you can get one of
>those for a bit less than a Z-series machine.

>If you don't want to pay for a new one, I just saw five of
>them for sale on eBay....


>(For that matter, SGI MIPS R10000/R12000 machines are 64-bit
>big-endian, and those are really cheap on eBay.)


PA-RISC 64 bit machines also -- bet a C160 workstation would sell for a
couple hundred bucks these days. It'll run Linux in addition to HP-UX,
both 64 bit.

Tom

unread,
Jan 17, 2002, 11:45:27 PM1/17/02
to
> I would just be curious to know how many bytes a dcbz will set to zero
> on a POWER4. I wouldn't be surprised at all if the number is 32, same as
> on all PowerPC processors ever built, no matter what the cache line size
> is. Unfortunately I have never been able to find a POWER4 instruction
> manual.

Definitely the PPC403 (and others in the 4xx family?) has a 16 byte
cacheline. And the high-end POWER processors are larger than 32byte
(128byte, if memory serves)

The cacheline size is not an architected PowerPC constant.

Tom Gooding

Ketil's local user

unread,
Jan 18, 2002, 3:19:11 AM1/18/02
to
Brannon...@yahoo.com (Brannon Batson) writes:

> Who cares about energy efficiency of the system? Does the cost of
> electrical power really differentiate total cost of ownership?

ISTR some IBM protagonist discussing the advantages of a (I assume
fully configured) z900 or similar, replacing two thousand NT boxes.
Or, in other words, 15kW IBM iron replacing about 1MW of PCs.

I don't know if that's a realistic scenario, but (using local prices
for electricity at about 10c/kWh) the cost differential is about
$100K/h. How it compares to the TCO of those installation, I have no
idea. :-)

-kzm
--
If I haven't seen further, it is by standing in the footprints of giants

Terje Mathisen

unread,
Jan 18, 2002, 2:08:35 AM1/18/02
to
"Timothy A. Seufert" wrote:
> Another possible candidate for a critical path would be the vector
> permute unit. I'm not sure how it's any different from a barrel
> shifter, though. Just like a barrel shifter, it should be
> straightforward to pipeline if necessary.

The 5-bit in-register lookup function could be another speed path, since
this function allows a fully general any-to-any lookup/shift function.

(I don't _know_ that this is hard to do, it is just that from my point
of view is such an _obviously_ useful function, but I don't know any
other architecture which has it.)

> Anybody who knows anything real about AltiVec implementation issues,
> please comment if you're free to do so. :)

Right!

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Andrew Mobbs

unread,
Jan 18, 2002, 4:44:49 AM1/18/02
to
Del Cecchi <dce...@vnet.ibm.com> wrote:
>In article <R6s*9g...@news.chiark.greenend.org.uk>,
> Andrew Mobbs <and...@chiark.greenend.org.uk> writes:
>|> McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
>|> >
>|> >The POWER4 chips dissipate a lot of heat. I don't know what we
>|> >have disclosed in public, but it is certainly in excess of 100
>|> >watts for the dual-cpu chip.
>|>
>|> Some IBM sales people were telling us a few weeks ago how energy
>|> efficient and cheap Regatta was compared to Sun and IA64 servers. They
>|> had multi-coloured graphs to prove it.
>
>And you are claiming that these claims are inconsistent? How many watts/TPM-C
>are the repective boxen?

Not at all. I was mostly pleased to see energy efficiency becoming
something the marketeers focus on. However, the couple of hundred disks
that go with such a system mean relative CPU efficiency doesn't quite
give as marked difference to overall power consumption as the graphs
superficially indicated.

I'll reserve judgement on the exact relative efficiency until after I've
benchmarked our application on Regatta.

Holger Bettag

unread,
Jan 18, 2002, 5:44:26 AM1/18/02
to
Terje Mathisen <terje.m...@hda.hydro.com> writes:

>
> "Timothy A. Seufert" wrote:
> > Another possible candidate for a critical path would be the vector
> > permute unit. I'm not sure how it's any different from a barrel
> > shifter, though. Just like a barrel shifter, it should be
> > straightforward to pipeline if necessary.
>
> The 5-bit in-register lookup function could be another speed path, since
> this function allows a fully general any-to-any lookup/shift function.
>

You are talking about the same thing. :-)

And yes, it can obviously be pipelined, because that's one of the
differences between MPC7400 (latency 1) and MPC7450 (latency 2,
throughput 1).

Holger

Terje Mathisen

unread,
Jan 18, 2002, 6:07:54 AM1/18/02
to
Holger Bettag wrote:
>
> Terje Mathisen <terje.m...@hda.hydro.com> writes:
>
> >
> > "Timothy A. Seufert" wrote:
> > > Another possible candidate for a critical path would be the vector
> > > permute unit. I'm not sure how it's any different from a barrel
> > > shifter, though. Just like a barrel shifter, it should be
> > > straightforward to pipeline if necessary.
> >
> > The 5-bit in-register lookup function could be another speed path, since
> > this function allows a fully general any-to-any lookup/shift function.
> >
> You are talking about the same thing. :-)

Yeah, I should probably have realized that. :-(

Anyway, it's the lookup nature of the AltiVec version that makes it so
much more interesting than the MMX and SSE permute opcodes:

This is just like the difference between a variable (register-specified)
and a fixed (immediate constant as part of the opcode) shift:

You only _need_ one of them, but having that be the variable instead of
the fixed version is obviously advantageous.

> And yes, it can obviously be pipelined, because that's one of the
> differences between MPC7400 (latency 1) and MPC7450 (latency 2,
> throughput 1).

Which makes perfect sense, since it was the speed bump caused by the
need to forward the controlling register content to the permute unit
that caused me to guess that it could be a critical path! :-)

Terje

PS. I really hate how the P4 have destroyed all the shift/multiply
opcodes, to the point where I haven't really written any significant
piece of P4-optimized asm. :-(

Sander Vesik

unread,
Jan 18, 2002, 6:12:55 PM1/18/02
to
Toon Moene <to...@moene.indiv.nluug.nl> wrote:
> McCalpin wrote:

>> The 0.13 micron process (IBM 9S) is operational, and we have POWER4
>> (shrink) parts in the lab produced in the this process, but we have
>> not publicly announced any of the products that contain these
>> parts. The CMOS 9S process was first announced in December, 2000.

> Which makes me think - I still need a 64bit, big-endian system to
> complete the set:

> 32-bit little-endian Intel Pentium II
> 64-bit little-endian Alpha 21264
> 32-bit big-endian PowerPC G4
> 64-bit big-endian IBM Z-series

> When's the 64-bit S/390 laptop planned ?

you should be able to get a ultrasparc laptop if what you want is a
64bit big-endian system...

> --
> Toon Moene - mailto:to...@moene.indiv.nluug.nl - phoneto: +31 346 214290
> Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
> Maintainer, GNU Fortran 77: http://gcc.gnu.org/onlinedocs/g77_news.html
> Join GNU Fortran 95: http://g95.sourceforge.net/ (under construction)

--
Sander

+++ Out of cheese error +++

Paul Hsieh

unread,
Jan 19, 2002, 7:24:45 AM1/19/02
to
Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> "Timothy A. Seufert" wrote:
> > Another possible candidate for a critical path would be the vector
> > permute unit. I'm not sure how it's any different from a barrel
> > shifter, though. Just like a barrel shifter, it should be
> > straightforward to pipeline if necessary.
>
> The 5-bit in-register lookup function could be another speed path, since
> this function allows a fully general any-to-any lookup/shift function.
>
> (I don't _know_ that this is hard to do, it is just that from my point
> of view is such an _obviously_ useful function, but I don't know any
> other architecture which has it.)

The Chromatic MPACT had a kind of "pseudo register lookup" mechanism.
Technically, you can consider the IA64's auto register rotation a kind
of register indexing scheme as well.

For any general renaming scheme, you kind of have to know which
register to rename obviously -- but that means that you need to know
the real value of this index register before you even work out the
dependencies for any instruction following it. It might just be that
there is no clean way to design these instructions without introducing
a scheduler to execution latency penalty into these instructions (much
like the partial register stall in the P6.) Alternatively, perhaps
they could somehow make the instruction dependent on all AltiVec
registers (thus making it a kind of fence instruction for AltiVec
instructions), but that might not be any easier to implement.

> > Anybody who knows anything real about AltiVec implementation issues,
> > please comment if you're free to do so. :)
>
> Right!

I don't know much about implementation, but intellectually property
alone may be reason enough for IBM to steer clear of it. What would
be the point of implementating AltiVec if it meant they had to pay
Motorola $5 for ever CPU they sold?

Unlike Apple/Motorola, it appears as though IBM still wants to run and
look good on Spec CPU. They probably figured that special multimedia
instructions are not terribly useful for that benchmark.

Tim Olson

unread,
Jan 19, 2002, 9:15:35 AM1/19/02
to
In article <796f488f.02011...@posting.google.com>,
q...@pobox.com (Paul Hsieh) wrote:

| I don't know much about implementation, but intellectually property
| alone may be reason enough for IBM to steer clear of it. What would
| be the point of implementating AltiVec if it meant they had to pay
| Motorola $5 for ever CPU they sold?

IBM was just as involved in the definition of Altivec (moreso, as I
remember) as Motorola and Apple were, so I don't think there would be
any licencing issue, here.

--

-- Tim Olson

Christian Bau

unread,
Jan 19, 2002, 10:21:17 AM1/19/02
to
Peter Boyle wrote:
>
> On Thu, 17 Jan 2002, Christian Bau wrote:
>
> > I would just be curious to know how many bytes a dcbz will set to zero
> > on a POWER4. I wouldn't be surprised at all if the number is 32, same as
> > on all PowerPC processors ever built, no matter what the cache line size
> > is. Unfortunately I have never been able to find a POWER4 instruction
> > manual.
>
> Hi,
>
> There was a thread on this a few months ago, and John McCalpin
> kindly posted the following, which I have lifted:

I do remember that. However, it discussed cache line size, which is
invisible to the user except for performance, and the wording in the
Motorola documents at least for dcbz uses a slightly different term for
whatever dcbz sets to 0. And it certainly makes sense not to change the
behaviour of dcbz just because the cache line changes (for example,
changing the behaviour of dcbz will break some code). And in that thread
a while ago there were proposals, for example if a processor changes
cache line size then it could introduce a new (efficient) instruction
that clears an amount of memory equal to the new, larger cache line
size, and the original dcbz is unchanged but less efficient. Or even
traps.

In that thread the question how many bytes dcbz erases has never been
answered. However, John McCalpin did post that on POWER4, it erases 128
byte.

Christian Bau

unread,
Jan 19, 2002, 10:26:45 AM1/19/02
to
Terje Mathisen wrote:
>
> "Timothy A. Seufert" wrote:
> > Another possible candidate for a critical path would be the vector
> > permute unit. I'm not sure how it's any different from a barrel
> > shifter, though. Just like a barrel shifter, it should be
> > straightforward to pipeline if necessary.

Seems to be speed critical, because on the 7450 the latency has
increased from one cycle to two, while all the other vector instructions
are unchanged.

Niels Jørgen Kruse

unread,
Jan 19, 2002, 1:02:59 PM1/19/02
to
I artiklen <3C48020A...@hda.hydro.com> , Terje Mathisen
<terje.m...@hda.hydro.com> skrev:

> PS. I really hate how the P4 have destroyed all the shift/multiply
> opcodes, to the point where I haven't really written any significant
> piece of P4-optimized asm. :-(

Slow (integer) multipliers seem to be an intel specialty, at least in the
IA32 line. Do they optimize for different instruction mixes, I wonder?

A short pipeline can artificially make multiply cheap relative to unit
latency instructions, but where is the limit (to the speed of multiply) in
the other direction?

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Bruce Hoult

unread,
Jan 19, 2002, 5:23:16 PM1/19/02