The question is simple: as IBM gets more aggressive with POWER4 and
its successors, is there any bottleneck inherent in the POWER
architecture that prevents it from reaching the same or better ISA and
CPU core elegance, scalability, ultimate speeds etc as Alpha
architecture? I don't want the answers "well, Alpha is not that fast
these days anyway" since that has most to do with Compaq gradually
choking its funding all these years anyway...
In particular, I'd like to see dual EV8-like cores with combined
SMT/DMT and CMP in the future POWER5, as well as those fames Herman
(population count) instructions that seem to benefit Alpha in quite a
few supercomputing apps...
Before that, POWER4+ could maybe get more direct links (like EV7) to
connect directly to at least four other POWER4+ CPUs to enable up to
128-way SMP...
Novba
The first power4 chips have direct chip to chip links on the 4 chip MCM
to provide a 8 processor SMP on a single MCM. Check out the
Microprocessor forum and Hot Chips presentations.
del cecchi
> The first power4 chips have direct chip to chip links on the 4 chip MCM
> to provide a 8 processor SMP on a single MCM. Check out the
> Microprocessor forum and Hot Chips presentations.
What is the interconnect between MCM's? How is IO hooked up?
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Check out http://mdronline.com/mpr/h/2000/1120/144703.html.
Each MCM obviously has 4 MCM-MCM 64-bit switches/buses operating at
~4GB/s for MCM-to-MCM interconnect allowing for up to 4 MCM's to be
interconnected in a "ring topology" for a total of 32 CPU's.
On top of that, each chip has its own 128-bit ~10GB/s full-duplex memory
bus designed to interface with the special external L3 cache and main
memory. It is unclear from the referenced document if that bus is shared
among the 4 chips in a MCM or not.
For I/O, each chip is also fitted with a 32-bit ~2GB/s full duplex "GX"
bus for I/O and also for clustering with external systems in a NUMA
arrangement.
--
Christer Palm
I can't complain about that!
>The question is simple: as IBM gets more aggressive with POWER4 and
>its successors, is there any bottleneck inherent in the POWER
>architecture that prevents it from reaching the same or better ISA and
>CPU core elegance, scalability, ultimate speeds etc as Alpha
>architecture?
The 64-bit PowerPC architecture is a bit richer than the Alpha
ISA, which means that high-performance implementations require
more effort. In POWER4, we have used an implementation route
that allows us to optimize for the simple op-codes and handle
the more complex opcodes with either "cracking" (in-line expansion
into a small number of more primitive ops) or simple microcoding.
(This was disclosed in MicroProcessor Report Volume 13, Number 13,
October 6, 1999.) Once you move the more complex ops off of the
critical path, we have about the same challenges as a simpler ISA
at achieving high frequency operation while keeping the pipeline
as short as possible.
>In particular, I'd like to see dual EV8-like cores with combined
>SMT/DMT and CMP in the future POWER5, as well as those fames Herman
>(population count) instructions that seem to benefit Alpha in quite a
>few supercomputing apps...
I can't comment on POWER5 (except that it will be "real good"),
but I would be really surprised if pop count were useful in "quite
a few" supercomputing applications. I know of only one customer
that is very interested in that operation, and have only found
one use for it in my own experience..... (It is a very easy way
to determine if an integer is a power of two -- though I don't
usually care about how long it takes to make this determination.)
>Before that, POWER4+ could maybe get more direct links (like EV7) to
>connect directly to at least four other POWER4+ CPUs to enable up to
>128-way SMP...
The POWER4 "building block" is a 4-chip (8-cpu) module. We have
already disclosed in public that we will be putting up to four of
these modules together to make up to 32-cpu systems.
More good stuff is coming along as we continue development of the
product line -- some with POWER4, some with POWER5, and some with
POWER6. Of course, I can't talk about any of that here.....
--
John D. McCalpin, Ph.D. mcca...@austin.ibm.com
Senior Technical Staff Member IBM POWER Microprocessor Development
"I am willing to make mistakes as long as
someone else is willing to learn from them."
The GX bus in turn is converted into multiple RIO links, which can drive cables
to other towers. RIO is a 500 MB/second/direction source synchronous
interconnect. A link has a byte-wide path in each direction. The RIO links can
be bridged to the I/O of your choice, so you can probably get microchannel if you
want it. :-) Mostly PCI is expected.
--
Del Cecchi
cecchi@rchland
> The 64-bit PowerPC architecture is a bit richer than the Alpha
> ISA, which means that high-performance implementations require
> more effort. In POWER4, we have used an implementation route
> that allows us to optimize for the simple op-codes and handle
> the more complex opcodes with either "cracking" (in-line expansion
> into a small number of more primitive ops) or simple microcoding.
> (This was disclosed in MicroProcessor Report Volume 13, Number 13,
> October 6, 1999.) Once you move the more complex ops off of the
> critical path, we have about the same challenges as a simpler ISA
> at achieving high frequency operation while keeping the pipeline
> as short as possible.
Doesn't this mean that POWER/PPC as a RISC ISA (= one where every
instruction is speed critical) has failed when old CISC tricks
(microcoding; uops) are needed again?
-Andi
Perhaps. Examples of a successful RISC architecture would be
interesting :-)
Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679
>Doesn't this mean that POWER/PPC as a RISC ISA (= one where every
>instruction is speed critical) has failed when old CISC tricks
>(microcoding; uops) are needed again?
That's a funny definition of RISC. Regardless, POWER isn't the only
RISC ISA to have deleted/deprecated/no-longer-single-cycle
instructions. Doing them with uops or microcode is certainly faster
than emulating them in the OS or an approach like PALcode. Faster, and
more expensive in silicon.
g
> I can't comment on POWER5 (except that it will be "real good"),
> but I would be really surprised if pop count were useful in "quite
> a few" supercomputing applications. I know of only one customer
> that is very interested in that operation, and have only found
> one use for it in my own experience..... (It is a very easy way
> to determine if an integer is a power of two -- though I don't
> usually care about how long it takes to make this determination.)
It's not needed for that -- (x & (x-1)) is zero IFF x is a power of two
(or zero). It works by clearing the least significant set bit.
I'm sure you knew this.
-- Bruce
> Doesn't this mean that POWER/PPC as a RISC ISA (= one where every
> instruction is speed critical) has failed when old CISC tricks
> (microcoding; uops) are needed again?
With instructions such as load/store multiple, it clearly never even
tried for 100% purity.
That doesn't make it the moral equivalent of x86 or something though.
-- Bruce
> Perhaps. Examples of a successful RISC architecture would be
> interesting :-)
e.g. ARM
-Andi
Ouch -- that sort of bit twiddling makes my poor Fortran brain
hurt. I hate it when I run across coding like this.
>I'm sure you knew this.
I did not know it, but I am trying to forget it as fast as I can!
I can only hope that the PowerPC architecture fails as
spectacularly as the x86 architecture. (I am speaking
financially here, of course....)
> In article <bruce-AD55D9....@news.akl.ihug.co.nz>,
> Bruce Hoult <br...@hoult.org> wrote:
> >In article <9mgps0$12ee$1...@ausnews.austin.ibm.com>,
> >mcca...@austin.ibm.com wrote:
> >
> >> , and [I] have only found
> >> one use for [pop count] in my own experience..... (It is a very easy
> >> way
> >> to determine if an integer is a power of two -- though I don't
> >> usually care about how long it takes to make this determination.)
> >
> >It's not needed for that -- (x & (x-1)) is zero IFF x is a power of two
> >(or zero). It works by clearing the least significant set bit.
>
> Ouch -- that sort of bit twiddling makes my poor Fortran brain
> hurt. I hate it when I run across coding like this.
I wouldn't want to see it uncommented somewhere, but it's a perfectly
fine thing to have smewhere in an (inline) function called
IsPowerOfTwo().
-- Bruce
Why???
Optimized code can indeed make use of a fast way to determine if a
number is a power of two, I've used that particular idiom several times.
popcount() is also useful, mostly when working with bitmaps.
A possible use would be for SQL runtime optimization, when joining
bitmaps to determine the best evaluation order.
I suspect that NSA could use it as a quick way to detect patterns that
significantly differ from purely random bitstreams?
Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
Perhaps I should have put a smiley! There are lots of successful
so-called RISC, but one thing that they all have in common is
heresy. There are lots of ways in which ARM is not a RISC
architecture, even excluding the floating-point (which, as usual,
is as CISC as CISC can be). Such as not being load/store/operate.
As I have said before, good performance comes from good design,
RISC was initially a response to certain mistakes of the 1970s,
it then became a religion (complete with priests, dogma, schisms
and heresy), and is now a marketing label.
>popcount() is also useful, mostly when working with bitmaps.
>
>A possible use would be for SQL runtime optimization, when joining
>bitmaps to determine the best evaluation order.
>
>I suspect that NSA could use it as a quick way to detect patterns that
>significantly differ from purely random bitstreams?
>
>Terje
Wouldn't popcount(a xor b) give the Hamming distance between two code words?
-George
Right, this is a specific example of what I mentioned.
Terje
PS. My SQL example isn't really that good, because when working with
_big_ bitmaps, not just one or a few registers worth of bits, there are
algorithms which perform very well even without a bitcount() opcode.
I believe it was Robert Harley who first showed me how you could use
bitslice operations to combine 15 64-bit words into just 4, and then do
the popcount() on each of those in paralell.
The resulting code runs at about 1-2 cycles/word, depending upon how
many operations the cpu can handle per cycle.
So, for the popcount opcode to be a win, it must allow superscalar
execution and at least one result per cycle.
popcount() can also be used to implement count_trailing_zeros() which in turn
is heavily used in the binary gcd (greatest common divisor) algorithm.
Assuming two's complement integer arithmetic:
count_trailing_zeros(n) == popcount((n & -n) - 1)
---
Joe Leherbauer Leherbauer at telering dot at
"Somewhere something incredible is waiting to be known."
-- Isaac Asimov
Terje Mathisen wrote:
> I believe it was Robert Harley who first showed me how you could use
> bitslice operations to combine 15 64-bit words into just 4, and then do
> the popcount() on each of those in paralell.
>
> The resulting code runs at about 1-2 cycles/word, depending upon how
> many operations the cpu can handle per cycle.
| Do you have detailed description of this algorithm handy?
First, Mea Culpa!
The speed should have been per byte of input, not per word. Ouch. :-(
The algorithm is based on parallel evaluation of full adders.
I.e. a full adder has 3 input bits (a, b, carry_in) and returns two
result bits (sum, carry_out).
This can be implemented with just a few basic logic operations:
temp = a xor b
sum = temp xor carry_in
carry_out = (a and b) or (temp and carry_in)
or in C:
temp = a ^ b;
sum = temp ^ carry_in;
carry_out = (a & b) | (temp & carry_in);
With a normal three-operand cpu, the code above translates into 5
opcodes, taking 2.5 cycles on a dual-issue machine.
So the way to do this using 64 or 128-bit registers is to define macros
to do these basic operations, and then combine them, i.e. to go from 7
inputs to 3 takes these operations:
(x1, x0) = full_add(i0, i1, i2);
(y1, y0) = full_add(i3, i4, i5);
(z1, s0) = full_add(x0, y0, i6);
(s2, s1) = full_add(x1, y1, z1);
This is a total of 4*5=20 instructions, plus 7 load operations.
At this point we have three result words (s2, s1, s0), where each set
bit in s2 is worth two bits in s1 and four in s0.
(Starting with 15 inputs would result in 4 result words, which makes the
final bit counting a little easier to parallelize.)
s0 = (s0 & MASK1) + ((s0 >> 1) & MASK1); // Max count = 2, MASK1 =
55555555...
s1 = (s1 & MASK1) + ((s1 >> 1) & MASK1);
s2 = (s2 & MASK1) + ((s2 >> 1) & MASK1);
12 ops
s0 = (s0 & MASK2) + ((s0 >> 2) & MASK2); // Max count = 4, MASK2 =
33333333...
s1 = (s1 & MASK2) + ((s1 >> 2) & MASK2);
s2 = (s2 & MASK2) + ((s2 >> 2) & MASK2);
12 ops
(There is another way to do these initial operations which is slightly
faster but harder to understand, I believe it dates back to the original
MIT hackers.)
s0 = (s0 + (s0 >> 4)) & MASK4_8; // Max count = 8, MASK4_8 =
0f0f0f0f...
s1 = (s1 + (s1 >> 4)) & MASK4_8;
s1 = (s1 + (s1 >> 4)) & MASK4_8;
9 ops
s0 = s0 + (s0 >> 8) +
(s1 << 1) + (s1 >> 7) +
(s2 << 2) + (s2 >> 6); // Max count = 16+32+64 =
112
10 ops
s0 = (s0 + (s0 >> 16)) & MASK8_32; // Max count = 224, MASK =
000000ff000000ff
3 ops
count = (s0 + (s0 >> 32)) & 511; // Max count = 448 (7 * 64)
3 ops.
The total seems to be 76 instructions to count the bits set in 56 bytes,
working with 120 input bytes gets even closer to 1 instruction/byte, but
this is still 2-4 times slower than a single-cycle popcount() opcode.
> McCalpin wrote:
> >
> > In article <bruce-AD55D9....@news.akl.ihug.co.nz>,
> > Bruce Hoult <br...@hoult.org> wrote:
> > >In article <9mgps0$12ee$1...@ausnews.austin.ibm.com>,
> > >mcca...@austin.ibm.com wrote:
> > >
> > >> , and [I] have only found
> > >> one use for [pop count] in my own experience..... (It is a very easy way
> > >> to determine if an integer is a power of two -- though I don't
> > >> usually care about how long it takes to make this determination.)
> > >
> > >It's not needed for that -- (x & (x-1)) is zero IFF x is a power of two
> > >(or zero). It works by clearing the least significant set bit.
> >
> > Ouch -- that sort of bit twiddling makes my poor Fortran brain
> > hurt. I hate it when I run across coding like this.
> >
> > >I'm sure you knew this.
> >
> > I did not know it, but I am trying to forget it as fast as I can!
>
> Why???
>
> Optimized code can indeed make use of a fast way to determine if a
> number is a power of two, I've used that particular idiom several times.
I think the main thing is to ensure that things like this are wrapped in a
macro rather than directly in the code. Not only does that make the code
much easier to read but it allows one to restructure the macro if
appropriate to a new platform. (Unlikely for this example, but can be
likely in other situations eg when one moves from big to little-endian.)
Maynard
SPARC still needs OOE. Don't hold your breath... except that Sun can Market!
[Somebody asked for "real" RISCs]
>> SPARC, PowerPC ( macs, RS/6000s - pSeries , embedded apps ) ,
> SPARC still needs OOE.
Uh, is that really a necessity? Didn't all early RISC chips* come
without?
Or were you thinking in the context of the Subject?
Personally, I think IBM may grab a big part of the supercomputer stuff
and the VMS segment. But for the compute clusters and render farms, I
expect the spoils to go to AMD and Intel, they haven't been far behind
lately, anyway.
I think it's a shame it was Intel that got Alpha, and not AMD.
-kzm
*) Early as in after the terminology was coined, of course. Not as in
stuff predating the invention of microcode.
--
If I haven't seen further, it is by standing in the footprints of giants
Absolutely!
In my own code I've tried to use this kind of stuff inside a macro, i.e.
if (ispowerof2(n)) {
...
}
This also makes it much easier to replace the macro with a function,
something that would be needed if I later needed to switch to a
different (fp?) number format.
BTW, the same test on a single prec fp number is even easier, since it
just needs to verify that the fractional part of the mantissa is zero:
#define float_ispowerof2(n) (((*(t_uint32 *) &n) & 0x07ffffff) == 0)
I really MUST remember to put smilies in when being jocular.
To remind people, here is the context:
>Doesn't this mean that POWER/PPC as a RISC ISA (= one where every
>instruction is speed critical) has failed when old CISC tricks
>(microcoding; uops) are needed again?
Perhaps. Examples of a successful RISC architecture would be
interesting :-)
The point is that, if PowerPC has failed as a RISC ISA, it is
unclear whether ANY others can be said to have succeeded. ARM
includes quite a few non-RISC features, for example.
Sure they did. Every wonder why Power4 and EV6 were late, though?
Or were you thinking in the context of the Subject?
>
> Personally, I think IBM may grab a big part of the supercomputer stuff
> and the VMS segment. But for the compute clusters and render farms, I
> expect the spoils to go to AMD and Intel, they haven't been far behind
> lately, anyway.
>
> I think it's a shame it was Intel that got Alpha, and not AMD.
C'est true, and a Big Coup for IBM!
Is POWER4 late?
I have only been at IBM for 2 years, so I might have missed
some really early claims, but my understanding was that IBM
has said that POWER4 is/was intended to ship in 2H2001, and
(according to various recent public IBM statements) we still
expect to ship in that time frame.
: #define float_ispowerof2(n) (((*(t_uint32 *) &n) & 0x07ffffff) == 0)
Ugh, doesn't it miss all denormalized powers of two?
/Serge.P
---
Home page: http://www.cobalt.chem.ucalgary.ca/ps/
Well, yes.
That's why you might need to replace the macro with a real function,
i.e. if you cannot guarantee that all numbers will be normalized.
One way to do so is to expand the float values to double, the test is
still nearly the same.
Going from double to extended precision isn't nearly so nice however:
Very unportable, and sometimes extremely slow. :-(
That's not what some IBM prospects told us the IBM salesfolk
told them -- but of course, it's rather hard to ascertain
if they're telling the truth.
--
<these messages express my own views, not those of my employer>
Alexis Cousein Senior Systems Engineer
SGI Belgium and Luxemburg a...@brussels.sgi.com
I have only been at IBM 33 years and I am pretty sure that Power4 is on
schedule. Anybody want to buy some? They come in a really impressive
box.
del cecchi
> I have only been at IBM 33 years and I am pretty sure that Power4 is on
> schedule. Anybody want to buy some? They come in a really impressive
> box.
Pictures?
Chris
-always trys to get a trip to the machine room to see the boxes
--
Chris Morgan <cm at mihalis.net> http://www.mihalis.net
Temp sig. - Enquire within
IBM now actually ships POWER chips in any other form than for it's own
equipment? Or is a *server what you meant by impressive box? 8-)
> del cecchi
--
Sander
+++ Out of cheese error +++
IBM does not typically sell its POWER processors apart
from systems.
The Regatta server is an impressive box.
How about the price ? Is it impressive too ?
If you interessted in PowerPC boxes and clustering try:
http://www.terrasoftsolutions.com/products/briQ/
"del cecchi" <dce...@msn.com> wrote in message
news:Zbfl7.476$V5....@eagle.america.net...
Sorry, they don't let me into the lab with a camera.
--
Del Cecchi
cecchi@rchland
Except:
* none of the companies listed as european resellers appear to really
be (only reselling YDL)
* they don't sell a bare bones version 8-(
Otherwise, it does look quite exciting.
That would be "photograph", which is only one kind of "picture".
Another kind is "ascii art" :-)
This was on CNN.com (edited to remove blather) (from august 28)
Feds buy IBM supercomputer to model climate
A U.S.
government research laboratory said Monday it bought a faster
supercomputer from International Business Machines Corp. for $20 million
to $30 million, which should help it more accurately predict long-range global
and regional climate changes.
The new supercomputer for the Oak Ridge National Laboratory in Tennessee will be
four times faster than the lab's current fastest supercomputer, also from IBM.
Nicknamed Cheetah, the computer can make four trillion calculations per second, or
four teraflops
...
The new machine is powered by IBM's Power4 microprocessor, which will debut
in IBM servers, code-named Regatta, this fall.
The laboratory and IBM would not be more specific about the price of the
computer, which will be installed starting in September.
--
Del Cecchi
cecchi@rchland
It comes in an impressive box that functions as a server if you don't
want to open it and remove the contents. Of course if you just take the
processor chips out, it is kind of an expensive shipping container. :-)
del
...
> A U.S.
> government research laboratory said Monday it bought a faster
> supercomputer from International Business Machines Corp. for $20 million
> to $30 million, which should help it more accurately predict long-range global
> and regional climate changes.
> Nicknamed Cheetah, the computer can make four trillion calculations per second, or
> four teraflops
....
> The new machine is powered by IBM's Power4 microprocessor, which will debut
> in IBM servers, code-named Regatta, this fall.
>
> Del Cecchi
> cecchi@rchland
Strange... I wonder why they didn't buy from Compaq? :)
Peter Boyle
I don't think Sun'll continue on the POWER track either :-).
Whether SPARC is a good enough architecture to take in
the gigahertz range isn't the issue. If the 386 can go giga,
anything can. That leaves the question whether the current
Sun/TI combo can produce credible multi-giga processors.
Hmmm...
--
Stefaan
--
Please visit our Webster http://xxxxxxxx.xxxx.xxx, write or e-mail to X&x
promptly,if you are interested.And X&x shall be pleased to render you any
further services. -- Spam from China
The 386 certainly hasn't "gone giga" -- the fact a design implementing
the same ISA has doesn't change that; I doubt the 386 is hyperpipelined
to the point of being run at GHz ranges, regardless of process.
The Pentium 4 article by Glenn Hinton, et al., in the Intel Technical
Journal, Q1, 2001 (http://developer.intel.com/technology/itj/q12001.htm)
indicates that the number of gate levels per pipeline stage stayed the
same from the 286 to the Pentium (Figure 2). The P6 core reduced this
number, and the P4P core reduced it further to run in the multiple GHZ
range.
(derived from figure)
"relative frequency approx. gate levels per
if same Si process" pipe stage (ignoring latch
overhead, clock skew)
286 1 N
386 1 N
486 1 N
P5 1 N
P6 1.5 2/3 N
P4P 2.5 2/5 N (or perhaps 1/3 N + skew?)
I seem to recall a demo of a 4 GHz P4, so that would seem to indicate
that you could build a 1.6 GHz 386 ;-)
(seriously - perhaps someone knows the exact number of gate levels?)
--
Mark Smotherman, Computer Science Dept., Clemson University, Clemson, SC
http://www.cs.clemson.edu/~mark/homepage.html
Granted, but (as Intel acknowledges by abandoning the 386
model for their 64-bitter), the ISA isn't exactly optimal;
SPARC seems (at least to this programmer) a better design.
Maybe those who master the intricacies of turning ISAs into
silicon could comment on the merits or demerits of SPARC as
a gigahertz processor...
The register stack may give it the collywobbles.
--
`-_-' In hoc signo hack, Peter da Silva.
'U` "A well-rounded geek should be able to geek about anything."
-- nic...@esperi.org
Disclaimer: WWFD?