POWER4 integer ALU latencies

Anil T Maliyekke

unread,

Sep 2, 2002, 6:03:17 PM9/2/02

to

I've seen a couple of powerpoint presentations on the web that state
that the integer add latency for POWER4 is 2 cycles. Is this true
and is it true for all operations dependent on the add result or just
those that will execute in a different integer unit? Also what
are the latencies for other common integer operations?

Anil

McCalpin

unread,

Sep 3, 2002, 5:33:54 AM9/3/02

to

In article <al0n75$rdh$1...@newsx.cc.uic.edu>,

Throughput is single-cycle for almost all operations in the
integer ALUs, but latency is two cycles for dependent
operations. The two cycle latency holds for operations in the
same ALU or in the other ALU.

If I recall correctly, the Alpha EV6 added the extra cycle only
if the dependent operations were in different ALU's. We did not
see enough benefit to make that optimization (not to mention that
it would be harder on POWER4 because of the higher frequency).
--
John D. McCalpin, Ph.D. mcca...@austin.ibm.com
Senior Technical Staff Member IBM POWER Microprocessor Development
"I am willing to make mistakes as long as
someone else is willing to learn from them."

Jan C. Vorbrüggen

unread,

Sep 3, 2002, 10:19:26 AM9/3/02

to

> If I recall correctly, the Alpha EV6 added the extra cycle only
> if the dependent operations were in different ALU's.

If we're talking about the same issue, then this is because each ALU set
has a register set associated with it, and there's a one-cycle latency for
writes to one set to propagate to the other. OTOH, the issue unit tries,
IIRC, to avoid scheduling dependent instructions to different sets.

Jan

Paul DeMone

unread,

Sep 3, 2002, 11:51:12 AM9/3/02

to

McCalpin wrote:

> Throughput is single-cycle for almost all operations in the
> integer ALUs, but latency is two cycles for dependent
> operations. The two cycle latency holds for operations in the
> same ALU or in the other ALU.
>
> If I recall correctly, the Alpha EV6 added the extra cycle only
> if the dependent operations were in different ALU's.

Extra cycle if crossing cluster boundary, not if from ALU within
the same cluster. The way code is scheduled for EV6 and instructions
issued to ALUs ensures a low occurrence of stall cycles from cross
cluster dependencies. This is all described pretty well in the EV6
compiler writer's guide.

> We did not
> see enough benefit to make that optimization (not to mention that
> it would be harder on POWER4 because of the higher frequency).

You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
bulk?

--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
pde...@igs.net architectures with MIPSed results but ALPHA's well
that ends well.

Thomas Womack

unread,

Sep 3, 2002, 12:19:23 PM9/3/02

to

In article <3D74DA70...@igs.net>, Paul DeMone <pde...@igs.net> wrote:

> You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
> bulk?

The 1.25GHz EV6 is in *0.18um* ?? Somehow, the phrase "talk about
misplaced effort" comes to mind; I was fairly sure Samsung had a
mature 0.13u process they could use. Then again, I've been expecting
the process shrink since the 833MHz model -- or was that a
tour-de-force of 0.25u?

Tom

Anil T Maliyekke

unread,

Sep 3, 2002, 2:35:55 PM9/3/02

to

McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
> Throughput is single-cycle for almost all operations in the
> integer ALUs, but latency is two cycles for dependent
> operations. The two cycle latency holds for operations in the
> same ALU or in the other ALU.

> If I recall correctly, the Alpha EV6 added the extra cycle only
> if the dependent operations were in different ALU's. We did not
> see enough benefit to make that optimization (not to mention that
> it would be harder on POWER4 because of the higher frequency).

POWER4's frequency isn't that much higher, and other CPU vendors
have gotten higher clockspeeds while maintaining 1 cycle latencies
for integer operations. If there aren't any instructions that can provide
results to dependent instructions in one cycle, it seems to me that
you could pretty much consider POWER4 a 650MHz 10-way decode, 16-way
issue processor. I'm surprised that POWER4 is competitive on integer
benchmarks. POWER4 OoO capabilities must be allowing the CPU to exploit
significant ILP.

Anil

Douglas Siebert

unread,

Sep 3, 2002, 4:47:45 PM9/3/02

to

It does sound rather like the P4's double clocked ALU in reverse, doesn't
it? Imagine the fun the PC versus Mac advocacy groups will have if Apple
ships machines using the IBM single chip POWER4 variant, with the PC crowd
claiming that if the POWER4 is claimed to run at 1.3GHz, the P4 is really
running at 5.6GHz if you measure in the same way, and the Mac people coming
back with arguments about a 4x better efficiency per cycle :)

--
Douglas Siebert dsie...@excisethis.khamsin.net

A good friend will help you move, a true friend will help you move a body.

Anton Ertl

unread,

Sep 4, 2002, 3:45:50 AM9/4/02

to

Douglas Siebert <dsie...@excisethis.khamsin.net> writes:
[Power 4 has two-cycle latency on ALU operations]

>It does sound rather like the P4's double clocked ALU in reverse, doesn't
>it? Imagine the fun the PC versus Mac advocacy groups will have if Apple
>ships machines using the IBM single chip POWER4 variant, with the PC crowd
>claiming that if the POWER4 is claimed to run at 1.3GHz, the P4 is really
>running at 5.6GHz if you measure in the same way,

AFAIK the Pentium 4 can do two sets of two _dependent_ adds per cycle,
so if you just take a chain of dependent adds as metric, it's 0.65GHz
vs. 5.6GHz (or, in 0.18u, 4GHz).

> and the Mac people coming
>back with arguments about a 4x better efficiency per cycle :)

Right: the Power4 does 64-bit operations, and the Pentium4
only 32-bit operations:-)

I remember reading a posting (or paper) where the author reported
leaving away the result forwarding to reach the 25MHz target speed for
a single-issue in-order CPU; without forwarding all ALU instructions
had two cycles latency. He commented that the CPU would have run
faster if it had forwarding (and single-cycle ALU instruction latency)
and 20MHz (which was the other alternative, but the contract specified
25MHz).

For a highly superscalar processor, I would expect latency to cost
even more (there is only so much ILP to go around). It would be
interesting to see a paper on the options the Power4 people looked at,
and what their simulations predicted for them. Or maybe I should just
check out how much SimpleScalar slows down with an ALU latency of 2:-).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anil T Maliyekke

unread,

Sep 4, 2002, 12:50:57 PM9/4/02

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Douglas Siebert <dsie...@excisethis.khamsin.net> writes:
> [Power 4 has two-cycle latency on ALU operations]
>>It does sound rather like the P4's double clocked ALU in reverse, doesn't
>>it? Imagine the fun the PC versus Mac advocacy groups will have if Apple
>>ships machines using the IBM single chip POWER4 variant, with the PC crowd
>>claiming that if the POWER4 is claimed to run at 1.3GHz, the P4 is really
>>running at 5.6GHz if you measure in the same way,

> AFAIK the Pentium 4 can do two sets of two _dependent_ adds per cycle,
> so if you just take a chain of dependent adds as metric, it's 0.65GHz
> vs. 5.6GHz (or, in 0.18u, 4GHz).

I just have to wonder what did Intel gain from these extremely fast
ALUs. Pentium 4 performance, while being the performance leader on
SPECInt2000 (though that is probably because no one else besides AMD is
shipping 0.13u parts), just doesn't seem that high when you consider the
fast ALUs/high clock speeds and the fast, low latency L1s.

Anil

David T. Wang

unread,

Sep 4, 2002, 1:42:02 PM9/4/02

to

Anil T Maliyekke <ama...@icarus.cc.uic.edu> wrote:

> I just have to wonder what did Intel gain from these extremely fast
> ALUs. Pentium 4 performance, while being the performance leader on
> SPECInt2000 (though that is probably because no one else besides AMD is
> shipping 0.13u parts), just doesn't seem that high when you consider the
> fast ALUs/high clock speeds and the fast, low latency L1s.

There's a few more tens of points of performance if you run the same piece
of silicon in SMT mode. It won't help the SPEC INT score, but it does
suggest that there are more than enough idle cycles through the execution
units including the ALU to sustain the execution of a second thread. So
perhaps the fast ALU did contribute to better throughput, even if it didn't
contribute to higher (single threaded) peak performance.

--

ddaavveewwaanngg@@wwaamm..uummdd..eedduu

Paul DeMone

unread,

Sep 4, 2002, 2:17:32 PM9/4/02

to

I have seen data to suggest that the half cycle integer ALUs in
the P4 increase performance on SPECint95 and SPECint2k by about
2.5 to 3% compared to a standard one cycle latency design.

Other studies suggest the POWER4's two cycle integer latency costs
it about 4 to 5% lower performance on a SPECint2k type workload
compared to one cycle latency although it would likely be sensitive
to how well the compiler schedules code.

Peter Dickerson

unread,

Sep 5, 2002, 4:28:33 AM9/5/02

to

"Paul DeMone" <pde...@igs.net> wrote in message
news:3D764E3C...@igs.net...

Is this at the same clock speed or does it incude the possible reduction in
clock speed required to achieve single cycle latency (more bypassing, single
cycle adders etc)?

Doesn't the P4 gain and lose by its double pumped ALU? Dependent adds may
have single cycle throughput but multibit shifts are slower than
single-cycle (1.5 cycles?)?

Peter

McCalpin

unread,

Sep 5, 2002, 9:14:47 AM9/5/02

to

In article <al2veb$t7r$1...@newsx.cc.uic.edu>,

Anil T Maliyekke <ama...@icarus.cc.uic.edu> wrote:

>McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
>> Throughput is single-cycle for almost all operations in the
>> integer ALUs, but latency is two cycles for dependent
>> operations. The two cycle latency holds for operations in the
>> same ALU or in the other ALU.
>

>[...] I'm surprised that POWER4 is competitive on integer

>benchmarks. POWER4 OoO capabilities must be allowing the CPU to exploit
>significant ILP.

Although integer functional unit dependency chains do exist, they
are a small part of the performance equation for most applications.

McCalpin

unread,

Sep 5, 2002, 9:12:00 AM9/5/02

to

In article <3D74DA70...@igs.net>, Paul DeMone <pde...@igs.net> wrote:
>

>McCalpin wrote:
>
>> We did not
>> see enough benefit to make that optimization (not to mention that
>> it would be harder on POWER4 because of the higher frequency).
>
>You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
>bulk?

I mean 1.3 GHz in volume in December 2001 for the POWER4,
compared to 1.25 GHz in August 2002 for the EV6. Eight months
is a big deal in this business.

Of course Paul knew that, but he does not seem to worry about
little details like reality when promoting his beloved Alpha.

Peter Boyle

unread,

Sep 5, 2002, 11:29:37 AM9/5/02

to

On Thu, 5 Sep 2002, McCalpin wrote:

> In article <3D74DA70...@igs.net>, Paul DeMone <pde...@igs.net> wrote:
> >
> >You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
> >bulk?
>
> I mean 1.3 GHz in volume in December 2001 for the POWER4,
> compared to 1.25 GHz in August 2002 for the EV6. Eight months
> is a big deal in this business.

Not for the Alpha. That was the problem! :(

Peter

> Of course Paul knew that, but he does not seem to worry about
> little details like reality when promoting his beloved Alpha.
>
> --
> John D. McCalpin, Ph.D. mcca...@austin.ibm.com
> Senior Technical Staff Member IBM POWER Microprocessor Development
> "I am willing to make mistakes as long as
> someone else is willing to learn from them."
>

Peter Boyle pbo...@physics.gla.ac.uk

Rupert Pigott

unread,

Sep 5, 2002, 4:39:38 PM9/5/02

to

"Peter Boyle" <pbo...@holyrood.ed.ac.uk> wrote in message
news:Pine.GSO.4.33.020905...@holyrood.ed.ac.uk...

>
> On Thu, 5 Sep 2002, McCalpin wrote:
>
> > In article <3D74DA70...@igs.net>, Paul DeMone <pde...@igs.net>
wrote:
> > >
> > >You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
> > >bulk?
> >
> > I mean 1.3 GHz in volume in December 2001 for the POWER4,
> > compared to 1.25 GHz in August 2002 for the EV6. Eight months
> > is a big deal in this business.
>
> Not for the Alpha. That was the problem! :(

I had a scan for 21264 papers... 97/98 500MHz... All us fanboys
had such high hopes too (fed by our imaginations rather than Digital
Err Compaq to be fair).

*cries into his Duvel*

Cheers,
Rupert

Paul DeMone

unread,

Sep 5, 2002, 6:01:41 PM9/5/02

to

McCalpin wrote:
>
> In article <3D74DA70...@igs.net>, Paul DeMone <pde...@igs.net> wrote:
> >
> >McCalpin wrote:
> >
> >> We did not
> >> see enough benefit to make that optimization (not to mention that
> >> it would be harder on POWER4 because of the higher frequency).
> >
> >You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
> >bulk?
>
> I mean 1.3 GHz in volume in December 2001 for the POWER4,
> compared to 1.25 GHz in August 2002 for the EV6. Eight months
> is a big deal in this business.

Must have been their choice of foundry. :-)

Just kidding. Compaq was showing shmoo's of first pass Cu EV68s
running over 1.4 GHz more than 18 months ago. Why the lag to
market? Who knows but I bet Bill has a theory.

>
> Of course Paul knew that, but he does not seem to worry about
> little details like reality when promoting his beloved Alpha.

LOL, you have mistaken me for someone else. I fully accept
that Elvis has left the building...

Jan C. Vorbrüggen

unread,

Sep 6, 2002, 5:27:23 AM9/6/02

to

> >> We did not
> >> see enough benefit to make that optimization (not to mention that
> >> it would be harder on POWER4 because of the higher frequency).
> >
> >You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
> >bulk?
>
> I mean 1.3 GHz in volume in December 2001 for the POWER4,
> compared to 1.25 GHz in August 2002 for the EV6. Eight months
> is a big deal in this business.

Your standards of discourse are dropping, John. What does the time of
market introduction of the EV68-1250 (which is based only to a small
extenton techical factors) have to do with the decision by the POWER4
microarchitects that was being discussed, and your argument for that
decision? Nothing that I can see.

Jan

Del Cecchi

unread,

Sep 6, 2002, 8:44:51 AM9/6/02

to

In article <3D7874FB...@mediasec.de>,

The time of market introduction is determined by the technology chosen for the
chip and the time to design said chip. In the case of POWER4 it was pretty much
simultaneous in that the design got done and verified at about the same time the
process folk were able to make enough to be able to sell them.

These two limits went together with the performance characteristics of the
selected process to drive the decisions, or at least that is the way it is
normally done. So the decision could have been done because it was easier to
make the desired timing, or because it would have limited the frequency in the
chosen process, and the performance impact was small.

It is hard to swallow sometimes when a chip that ones team has worked on is
dissed by unfair comparison to a later chip or to vapor. BTW I had nothing to do
with POWER4 chip except as a spectator.

I have heard so much over the years about Alpha and wonderous performance that I
lose track. What Alpha processors is HPQ actually shipping in systems to
customers in 2002? I know I could go look at the web site but it's Friday
Morning and I have a meeting coming up and I'm too lazy.

del cecchi
--

Del Cecchi
cec...@us.ibm.com
Personal Opinions Only

patte...@skipthis.sbcglobal.net

unread,

Sep 7, 2002, 1:36:42 AM9/7/02

to

> Doesn't the P4 gain and lose by its double pumped ALU? Dependent adds may
> have single cycle throughput but multibit shifts are slower than
> single-cycle (1.5 cycles?)?

Willamette's basic bitslice approach,
which has been discussed in Microprocessor Report,
might be able to support
left shifts at 1-cycle latency --- although shifters are big,
and the more stuff you stick in there, the slower it gets.

Right shifts flow against the bitsclicing, so are intrinsically
slower on such a datapath.

Deciding to make them all uniformly slower may just have
been a design simplification.

---

If you use redundant arithmetic to get fast ALU dependencies,
shifts are intrinsically slower by a factor related to the shift width.

Robert Harley

unread,

Sep 8, 2002, 7:58:09 AM9/8/02

to

Anil T Maliyekke <ama...@icarus.cc.uic.edu> writes:

> [...] Pentium 4 performance, while being the performance leader on
> SPECInt2000 [...]

The highest SPECint2000 score reported at www.specbench.org is 928 for
the 1250 MHz Alpha.

Bye,
Rob.

David T. Wang

unread,

Sep 8, 2002, 12:09:03 PM9/8/02

to

Robert Harley <har...@estephe.inria.fr> wrote:

It's not on SPEC's web site, but Intel is reporting 976 for the 2.8 GHz
P4 on its own web site.

http://www.intel.com/performance/resources/desktop/charts.htm#SPECint

--

ddaavveewwaanngg@@wwaamm..uummdd..eedduu

Norbert Juffa

unread,

Sep 13, 2002, 1:19:56 PM9/13/02

to

"David T. Wang" <f...@bar.invalid> wrote in message news:alfsmv$sc1$1...@grapevine.wam.umd.edu...

> Robert Harley <har...@estephe.inria.fr> wrote:
[...]
> > The highest SPECint2000 score reported at www.specbench.org is 928 for
> > the 1250 MHz Alpha.
>
> It's not on SPEC's web site, but Intel is reporting 976 for the 2.8 GHz
> P4 on its own web site.
>
> http://www.intel.com/performance/resources/desktop/charts.htm#SPECint

The official SPEC results for the 2.8GHz Intel P4 are now at

http://www.spec.org/osg/cpu2000/results/res2002q3/

SPEC has a 2 week review process for all submissions which no doubt caused
the delay between the publication on Intel's website and the SPEC website.

-- Norbert