Anil
Throughput is single-cycle for almost all operations in the
integer ALUs, but latency is two cycles for dependent
operations. The two cycle latency holds for operations in the
same ALU or in the other ALU.
If I recall correctly, the Alpha EV6 added the extra cycle only
if the dependent operations were in different ALU's. We did not
see enough benefit to make that optimization (not to mention that
it would be harder on POWER4 because of the higher frequency).
--
John D. McCalpin, Ph.D. mcca...@austin.ibm.com
Senior Technical Staff Member IBM POWER Microprocessor Development
"I am willing to make mistakes as long as
someone else is willing to learn from them."
If we're talking about the same issue, then this is because each ALU set
has a register set associated with it, and there's a one-cycle latency for
writes to one set to propagate to the other. OTOH, the issue unit tries,
IIRC, to avoid scheduling dependent instructions to different sets.
Jan
McCalpin wrote:
> Throughput is single-cycle for almost all operations in the
> integer ALUs, but latency is two cycles for dependent
> operations. The two cycle latency holds for operations in the
> same ALU or in the other ALU.
>
> If I recall correctly, the Alpha EV6 added the extra cycle only
> if the dependent operations were in different ALU's.
Extra cycle if crossing cluster boundary, not if from ALU within
the same cluster. The way code is scheduled for EV6 and instructions
issued to ALUs ensures a low occurrence of stall cycles from cross
cluster dependencies. This is all described pretty well in the EV6
compiler writer's guide.
> We did not
> see enough benefit to make that optimization (not to mention that
> it would be harder on POWER4 because of the higher frequency).
You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
bulk?
--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
pde...@igs.net architectures with MIPSed results but ALPHA's well
that ends well.
> You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
> bulk?
The 1.25GHz EV6 is in *0.18um* ?? Somehow, the phrase "talk about
misplaced effort" comes to mind; I was fairly sure Samsung had a
mature 0.13u process they could use. Then again, I've been expecting
the process shrink since the 833MHz model -- or was that a
tour-de-force of 0.25u?
Tom
> If I recall correctly, the Alpha EV6 added the extra cycle only
> if the dependent operations were in different ALU's. We did not
> see enough benefit to make that optimization (not to mention that
> it would be harder on POWER4 because of the higher frequency).
POWER4's frequency isn't that much higher, and other CPU vendors
have gotten higher clockspeeds while maintaining 1 cycle latencies
for integer operations. If there aren't any instructions that can provide
results to dependent instructions in one cycle, it seems to me that
you could pretty much consider POWER4 a 650MHz 10-way decode, 16-way
issue processor. I'm surprised that POWER4 is competitive on integer
benchmarks. POWER4 OoO capabilities must be allowing the CPU to exploit
significant ILP.
Anil
It does sound rather like the P4's double clocked ALU in reverse, doesn't
it? Imagine the fun the PC versus Mac advocacy groups will have if Apple
ships machines using the IBM single chip POWER4 variant, with the PC crowd
claiming that if the POWER4 is claimed to run at 1.3GHz, the P4 is really
running at 5.6GHz if you measure in the same way, and the Mac people coming
back with arguments about a 4x better efficiency per cycle :)
--
Douglas Siebert dsie...@excisethis.khamsin.net
A good friend will help you move, a true friend will help you move a body.
AFAIK the Pentium 4 can do two sets of two _dependent_ adds per cycle,
so if you just take a chain of dependent adds as metric, it's 0.65GHz
vs. 5.6GHz (or, in 0.18u, 4GHz).
> and the Mac people coming
>back with arguments about a 4x better efficiency per cycle :)
Right: the Power4 does 64-bit operations, and the Pentium4
only 32-bit operations:-)
I remember reading a posting (or paper) where the author reported
leaving away the result forwarding to reach the 25MHz target speed for
a single-issue in-order CPU; without forwarding all ALU instructions
had two cycles latency. He commented that the CPU would have run
faster if it had forwarding (and single-cycle ALU instruction latency)
and 20MHz (which was the other alternative, but the contract specified
25MHz).
For a highly superscalar processor, I would expect latency to cost
even more (there is only so much ILP to go around). It would be
interesting to see a paper on the options the Power4 people looked at,
and what their simulations predicted for them. Or maybe I should just
check out how much SimpleScalar slows down with an ALU latency of 2:-).
- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
> AFAIK the Pentium 4 can do two sets of two _dependent_ adds per cycle,
> so if you just take a chain of dependent adds as metric, it's 0.65GHz
> vs. 5.6GHz (or, in 0.18u, 4GHz).
I just have to wonder what did Intel gain from these extremely fast
ALUs. Pentium 4 performance, while being the performance leader on
SPECInt2000 (though that is probably because no one else besides AMD is
shipping 0.13u parts), just doesn't seem that high when you consider the
fast ALUs/high clock speeds and the fast, low latency L1s.
Anil
> I just have to wonder what did Intel gain from these extremely fast
> ALUs. Pentium 4 performance, while being the performance leader on
> SPECInt2000 (though that is probably because no one else besides AMD is
> shipping 0.13u parts), just doesn't seem that high when you consider the
> fast ALUs/high clock speeds and the fast, low latency L1s.
There's a few more tens of points of performance if you run the same piece
of silicon in SMT mode. It won't help the SPEC INT score, but it does
suggest that there are more than enough idle cycles through the execution
units including the ALU to sustain the execution of a second thread. So
perhaps the fast ALU did contribute to better throughput, even if it didn't
contribute to higher (single threaded) peak performance.
--
ddaavveewwaanngg@@wwaamm..uummdd..eedduu
I have seen data to suggest that the half cycle integer ALUs in
the P4 increase performance on SPECint95 and SPECint2k by about
2.5 to 3% compared to a standard one cycle latency design.
Other studies suggest the POWER4's two cycle integer latency costs
it about 4 to 5% lower performance on a SPECint2k type workload
compared to one cycle latency although it would likely be sensitive
to how well the compiler schedules code.
Is this at the same clock speed or does it incude the possible reduction in
clock speed required to achieve single cycle latency (more bypassing, single
cycle adders etc)?
Doesn't the P4 gain and lose by its double pumped ALU? Dependent adds may
have single cycle throughput but multibit shifts are slower than
single-cycle (1.5 cycles?)?
Peter
Although integer functional unit dependency chains do exist, they
are a small part of the performance equation for most applications.
I mean 1.3 GHz in volume in December 2001 for the POWER4,
compared to 1.25 GHz in August 2002 for the EV6. Eight months
is a big deal in this business.
Of course Paul knew that, but he does not seem to worry about
little details like reality when promoting his beloved Alpha.
> In article <3D74DA70...@igs.net>, Paul DeMone <pde...@igs.net> wrote:
> >
> >You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
> >bulk?
>
> I mean 1.3 GHz in volume in December 2001 for the POWER4,
> compared to 1.25 GHz in August 2002 for the EV6. Eight months
> is a big deal in this business.
Not for the Alpha. That was the problem! :(
Peter
> Of course Paul knew that, but he does not seem to worry about
> little details like reality when promoting his beloved Alpha.
>
> --
> John D. McCalpin, Ph.D. mcca...@austin.ibm.com
> Senior Technical Staff Member IBM POWER Microprocessor Development
> "I am willing to make mistakes as long as
> someone else is willing to learn from them."
>
Peter Boyle pbo...@physics.gla.ac.uk
I had a scan for 21264 papers... 97/98 500MHz... All us fanboys
had such high hopes too (fed by our imaginations rather than Digital
Err Compaq to be fair).
*cries into his Duvel*
Cheers,
Rupert
McCalpin wrote:
>
> In article <3D74DA70...@igs.net>, Paul DeMone <pde...@igs.net> wrote:
> >
> >McCalpin wrote:
> >
> >> We did not
> >> see enough benefit to make that optimization (not to mention that
> >> it would be harder on POWER4 because of the higher frequency).
> >
> >You mean 1.3 GHz in 0.18 um SOI as opposed to 1.25 GHz in 0.18 um
> >bulk?
>
> I mean 1.3 GHz in volume in December 2001 for the POWER4,
> compared to 1.25 GHz in August 2002 for the EV6. Eight months
> is a big deal in this business.
Must have been their choice of foundry. :-)
Just kidding. Compaq was showing shmoo's of first pass Cu EV68s
running over 1.4 GHz more than 18 months ago. Why the lag to
market? Who knows but I bet Bill has a theory.
>
> Of course Paul knew that, but he does not seem to worry about
> little details like reality when promoting his beloved Alpha.
LOL, you have mistaken me for someone else. I fully accept
that Elvis has left the building...
Your standards of discourse are dropping, John. What does the time of
market introduction of the EV68-1250 (which is based only to a small
extenton techical factors) have to do with the decision by the POWER4
microarchitects that was being discussed, and your argument for that
decision? Nothing that I can see.
Jan
The time of market introduction is determined by the technology chosen for the
chip and the time to design said chip. In the case of POWER4 it was pretty much
simultaneous in that the design got done and verified at about the same time the
process folk were able to make enough to be able to sell them.
These two limits went together with the performance characteristics of the
selected process to drive the decisions, or at least that is the way it is
normally done. So the decision could have been done because it was easier to
make the desired timing, or because it would have limited the frequency in the
chosen process, and the performance impact was small.
It is hard to swallow sometimes when a chip that ones team has worked on is
dissed by unfair comparison to a later chip or to vapor. BTW I had nothing to do
with POWER4 chip except as a spectator.
I have heard so much over the years about Alpha and wonderous performance that I
lose track. What Alpha processors is HPQ actually shipping in systems to
customers in 2002? I know I could go look at the web site but it's Friday
Morning and I have a meeting coming up and I'm too lazy.
del cecchi
--
Del Cecchi
cec...@us.ibm.com
Personal Opinions Only
Willamette's basic bitslice approach,
which has been discussed in Microprocessor Report,
might be able to support
left shifts at 1-cycle latency --- although shifters are big,
and the more stuff you stick in there, the slower it gets.
Right shifts flow against the bitsclicing, so are intrinsically
slower on such a datapath.
Deciding to make them all uniformly slower may just have
been a design simplification.
---
If you use redundant arithmetic to get fast ALU dependencies,
shifts are intrinsically slower by a factor related to the shift width.
The highest SPECint2000 score reported at www.specbench.org is 928 for
the 1250 MHz Alpha.
Bye,
Rob.
It's not on SPEC's web site, but Intel is reporting 976 for the 2.8 GHz
P4 on its own web site.
http://www.intel.com/performance/resources/desktop/charts.htm#SPECint
--
ddaavveewwaanngg@@wwaamm..uummdd..eedduu
The official SPEC results for the 2.8GHz Intel P4 are now at
http://www.spec.org/osg/cpu2000/results/res2002q3/
SPEC has a 2 week review process for all submissions which no doubt caused
the delay between the publication on Intel's website and the SPEC website.
-- Norbert