ATTACK OF KILLER MICROS

John D. McCalpin

unread,

Oct 14, 1989, 3:07:25 PM10/14/89

to

In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov writes:

>ma...@mips.com pointed out some important considerations in the issue
>of whether supercomputers as we know them will survive. I thought
>that I would attempt to get a discussion started. Here is a simple
>fact for the mill, related to the question of whether or not machines
>delivering the fastest performance at any price have room in the
>market.

>Fact number 1:
>The best of the microprocessors now EXCEED supercomputers for scalar
>performance and the performance of microprocessors is not yet stagnant.
>On scalar codes, commodity microprocessors ARE the fastest machines at
>any price and custom cpu architectures are doomed in this market.

>bro...@maddog.llnl.gov, bro...@maddog.uucp

This much has been fairly obvious for a few years now, and was made
especially clear by the introduction of the MIPS R-3000 based machines
at about the beginning of 1989. I think that this point is irrelevant
to the more appropriate purpose of supercomputers, which is to run
long (or large), compute-intensive problems that happen to map well
onto available architectures.

Both factors (memory/time and efficiency) are important here. It is
generally not necessary to run short jobs on supercomputers, and it is
not cost-effective to run scalar jobs on vector machines. On the
other hand, I have several codes that run >100 times faster on the
ETA-10G relative to a 25 MHz MIPS R-3000. Since I need to run these
codes for hundreds of ETA-10G hours, the equivalent time on the
workstation is over one year.

The introduction of vector workstations (Ardent & Stellar) changes
these ratios substantially. The ETA-10G runs my codes only 20 times
faster than the new Ardent Titan. In this environment, the important
question is, "Can I get an average of more than 1.2 hours of
supercomputer time per day". If not, then the Ardent provides better
average wall-clock turnaround.

It seems to me that the introduction of fast scalar and vector
workstations can greatly enhance the _important_ function of
supercomputers --- which is to allow the calculation of problems that
are otherwise too big to handle. By removing scalar jobs and vector
jobs of short duration from the machine, more resources can be
allocated to the large calculations that cannot proceed elsewhere.

Enough mumbling....
--
John D. McCalpin - mcca...@masig1.ocean.fsu.edu
mcca...@scri1.scri.fsu.edu
mcca...@delocn.udel.edu

bro...@maddog.llnl.gov

unread,

Oct 14, 1989, 4:58:54 PM10/14/89

to

Robert Colwell

unread,

Oct 14, 1989, 10:12:04 PM10/14/89

to

In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov () writes:
>Fact number 1:
>The best of the microprocessors now EXCEED supercomputers for scalar
>performance and the performance of microprocessors is not yet stagnant.
>On scalar codes, commodity microprocessors ARE the fastest machines at
>any price and custom cpu architectures are doomed in this market.

I take my hat off to them, too, because that's no mean feat. But don't
forget that the supercomputers didn't set out to be the fastest machines
on scalar code. If they had, they'd all have data caches, non-interleaved
main memory, and no vector facilities. What the supercomputer designers
are trying to do is balance their machines to optimally execute a certain
set of programs, not the least of which are the LLL loops. In practice
this means that said machines have to do very well on vectorizable code,
while not falling down badly on the scalar stuff (lest Amdahl's law
come to call.)

So while it's ok to chortle at how the micros have caught up on the scalar
stuff, I think it would be an unwarranted extrapolation to imply that the
supers have been superseded unless you also specify the workload.
And by the way, it's the design constraints at the heavy-duty, high
parallelism, all functional-units-going-full-tilt-using-the-entire-memory-
bandwidth that make the price of the supercomputers so high, not the
constraints that predominate at the scalar end. That's why I conclude
that when the micro/workstation guys want to play in the supercomputer
sandbox they'll either have to bring their piggy banks to buy the
appropriate I/O and memory, or convince the users that they can live
without all that performance.

Bob Colwell ..!uunet!mfci!colwell
Multiflow Computer or col...@multiflow.com
31 Business Park Dr.
Branford, CT 06405 203-488-6090

Preston Briggs

unread,

Oct 14, 1989, 10:59:35 PM10/14/89

to

In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov () writes:

>The best of the microprocessors now EXCEED supercomputers for scalar
>performance and the performance of microprocessors is not yet stagnant.

Is this a fair statement? I've played some with the i860 and
I can write (by hand so far) code that is pretty fast.
However, the programs where it really zooms are vectorizable.
That is, I can make this micro solve certain problems well;
but these are the same problems that a vector machines handle well.

Getting good FP performance from a micro seems to require
pipelining. Keeping the pipe(s) full seems to require a certain amount
of parallelism and regularity. Vectorizable loops work wonderfully well.

Perhaps I've misunderstood your intent, though. Perhaps you meant
that an i860 (or Mips or whatever) can outrun a Cray (or Nec or whatever)
on some programs. I guess I'm still doubtful. Do you have examples
you can tell us about?

Thanks,
Preston Briggs

Donald Lindsay

unread,

Oct 14, 1989, 11:26:48 PM10/14/89

to

Gordon Bell, in the September CACM (p.1095) says, "By the end of
1989, the performance of the RISC, one-chip microprocessor should
surpass and remain ahead of any available minicomputer or mainframe
for nearly every significant benchmark and computational workload.
By using ECL gate arrays, it is relatively easy to build processors
that operate at 200 MHz (5 ns. clock) by 1990." (For those who don't
know, Mr. Bell has his name on the PDP-11, the VAX, and the Ardent
workstation.)

The big iron is fighting back, and that involves reducing their chip
count. Once, a big cpu took ~10^4 chips: now it's more like 10^2. I
expect it will shortly be ~10 chips. Shorter paths, you know.

I see the hot micros and the big iron meeting in the middle. What
will distinguish their processors? Mainly, there will be cheap
systems. And then, there will be expensive ones, with liquid cooling,
superdense packaging, mongo buses, bad yield, all that stuff. Even
when no multichip processors remain, there will still be $1K systems
and $10M systems. Of course, there is no chance that the $10M system
will be uniprocessor.
--
Don D.C.Lindsay Carnegie Mellon Computer Science

Eugene Brooks

unread,

Oct 15, 1989, 2:20:48 PM10/15/89

to

In article <10...@m3.mfci.UUCP> col...@mfci.UUCP (Robert Colwell) writes:
>So while it's ok to chortle at how the micros have caught up on the scalar
>stuff, I think it would be an unwarranted extrapolation to imply that the
>supers have been superseded unless you also specify the workload.

Microprocessor development is not ignoring vectorizable workloads. The
latest have fully pipeline floating point and are capable of pipelining
several memory accesses. As I noted, interleaving directly on the memory
chip is trivial and memory chip makers will do it soon. Micros now dominate
the performance game for scalar code and are moving on to vectorizable code.
After all, these little critters mutate and become more voracious every
6 months and vectorizable code is the only thing left for them to conquer.
No NEW technology needs to be developed, all the micro-chip and memory-chip
makers need to do is to decide to take over the supercomputer market.

They will do this with their commodity parts.

Supercomputers of the future will be scalable multiprocessors made of many
hundreds to thousands of commodity microprocessors. They will be commodity
parts because these parts will be the fastest around and they will be cheap.
These scalable machines will have hundreds of commodity disk drives ganged up
for parallel access. Commodity parts will again be used because of the
cost advantage leveraged into a scalable system using commodity parts.
The only custom logic will be the interconnect which glues the system together,
and error correcting logic which glues many disk drives together into a
reliable high performance system. The CM data vault is a very good model here.

NOTHING WILL WITHSTAND THE ATTACK OF THE KILLER MICROS!

bro...@maddog.llnl.gov, bro...@maddog.uucp

Eugene Brooks

unread,

Oct 15, 1989, 2:30:11 PM10/15/89

to

In article <21...@brazos.Rice.edu> pre...@titan.rice.edu (Preston Briggs) writes:
>In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov () writes:
>>The best of the microprocessors now EXCEED supercomputers for scalar
>>performance and the performance of microprocessors is not yet stagnant.
>
>Is this a fair statement? I've played some with the i860 and

Yes, in the sense that a scalar dominated program has been compiled for
the i860 with a "green" compiler, no pun intended, and the same program
was compiled with a mature optimizing compiler on the XMP, and the 40MHZ
i860 is faster for this code. Better compilers for the i860 will open
up the speed gap relative to the supercomputers.

>I can write (by hand so far) code that is pretty fast.
>However, the programs where it really zooms are vectorizable.

Yes, this micro beats the super on scalar code, and is not too sloppy
for hand written code which exploits its cache and pipes well. The
compilers are not there yet for the vectorizable stuff on the i860.
Even if there were good compilers, the scalar-vector speed differential
is not as great on the i860 as it is on a supercomputer. Of course,
interleaved memory chips will arrive and microprocessors will use them.
Eventually the high performance micros will take the speed prize for
vectorizable code as well, but this will require another few years of
development.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Eugene Brooks

unread,

Oct 15, 1989, 2:39:09 PM10/15/89

to

In article <65...@pt.cs.cmu.edu> lin...@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>Gordon Bell, in the September CACM (p.1095) says, "By the end of
>1989, the performance of the RISC, one-chip microprocessor should
>surpass and remain ahead of any available minicomputer or mainframe
>for nearly every significant benchmark and computational workload.

It has already happened for SOME workloads, those which hit cache well
and are scalar dominated. This was done without ECL parts. The ECL
parts will only make matters worse for custom processors, as Bell indicates,
dominating performance for all workloads.

>I see the hot micros and the big iron meeting in the middle. What
>will distinguish their processors?

Nothing.

>Mainly, there will be cheap
>systems. And then, there will be expensive ones, with liquid cooling,
>superdense packaging, mongo buses, bad yield, all that stuff. Even
>when no multichip processors remain, there will still be $1K systems
>and $10M systems. Of course, there is no chance that the $10M system
>will be uniprocessor.

The $10M systems will be scalable systems built out of the same microprocessor.
These systems will probably be based on coherent caches, the micros having
respectable on chip caches which stay in sync with very large off chip
caches. The off chip caches are kept coherent through scalable networks.
The "custom" value added part of the machine for the supercomputer vendor
to design is the interconnect and the I-O system. The supercomputer vendor
will still have a cooling problem on his hands because of the density of
heat sources in such a machine.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Mike Haertel

unread,

Oct 15, 1989, 3:24:01 PM10/15/89

to

In article <10...@m3.mfci.UUCP> col...@mfci.UUCP (Robert Colwell) writes:

>I take my hat off to them, too, because that's no mean feat. But don't
>forget that the supercomputers didn't set out to be the fastest machines
>on scalar code. If they had, they'd all have data caches, non-interleaved
>main memory, and no vector facilities. What the supercomputer designers

Excuse me, non-interleaved main memory? I've always assumed that
interleaved memory could help scalar code too. After all, instruction
fetch tends to take place from successive addresses. Of course if
main memory is very fast there is no point to interleaving it, but
if all you've got is drams with slow cycle times, I would expect
that interleaving them would benefit even straight scalar code.
--
Mike Haertel <mi...@stolaf.edu>
``There's nothing remarkable about it. All one has to do is hit the right
keys at the right time and the instrument plays itself.'' -- J. S. Bach

Eric S. Raymond

unread,

Oct 15, 1989, 8:29:13 PM10/15/89

to

In <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov wrote:
> The best of the microprocessors now EXCEED supercomputers for scalar
> performance and the performance of microprocessors is not yet stagnant.
> On scalar codes, commodity microprocessors ARE the fastest machines at
> any price and custom cpu architectures are doomed in this market.

Yes. And though this is a recent development, an unprejudiced observer could
have seen it coming for several years. I did, and had the temerity to say so
in print way back in 1986. My reasoning then is still relevant; *speed goes
where the volume market is*, because that's where the incentive and development
money to get the last mw-sec out of available fabrication technology is
concentrated.

Notice that nobody talks about GaAs technology for general-purpose processors
any more? Or dedicated Lisp machines? Both of these got overhauled by silicon
microprocessors because commodity chipmakers could amortize their development
costs over such a huge base that it became economical to push silicon to
densities nobody thought it could attain.

You heard it here first:

The supercomputer crowd is going to get its lunch eaten the same way. They're
going to keep sinking R&D funds into architectural fads, exotic materials,
and the quest for ever more ethereal heights of floating point performance.
They'll have a lot of fun and generate a bunch of sexy research papers.

Then one morning they're going to wake up and discover that the commodity
silicon guys, creeping in their petty pace from day to day, have somehow
managed to get better real-world performance out of their little boxes. And
supercomputers won't have a separate niche market anymore. And the
supercomputer companies will go the way of LMI, taking a bunch of unhappy
investors with them. La di da.

Trust me. I've seen it happen before...
--
Eric S. Raymond = er...@snark.uu.net (mad mastermind of TMN-Netnews)

George Seibel

unread,

Oct 16, 1989, 7:00:42 AM10/16/89

to

In <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov wrote:
> The best of the microprocessors now EXCEED supercomputers for scalar
> performance and the performance of microprocessors is not yet stagnant.
> On scalar codes, commodity microprocessors ARE the fastest machines at
> any price and custom cpu architectures are doomed in this market.

Speaking of "commodities", I think a lot of people have lost sight
of, or perhaps never recognized something about the vast majority of
supercomputers. They are shared. How often do you get a Cray processor
all to yourself? Not very often, unless you have lots of money, or Uncle
Sam is picking up the tab so you can design atomic bombs faster. As
soon as you have more than one job per processor, you're talking about
*commodity Mflops*. The issue is no longer performance at any cost, because
if it was you would order another machine at that point. The important
thing is Mflops/dollar for most people, and that's where the micros are
going to win in a lot of cases.

George Seibel, UCSF

Rob Peglar x615

unread,

Oct 16, 1989, 10:39:36 AM10/16/89

to

Brooks is making a good point here. By "this market", I assume he means
the one defined above, (as well as by mash) - to paraphrase, "the fastest
box at any price". I'll let go what "fastest" and "box" mean for sake
of easy discussion :-) Most of us, I hope, can fathom what price is.

Anyway, I agree with mash that there is - albeit small - a
market for the machine with the highest peak absolute performance
(pick your number, the most popular one recently seems to be Linpack
100x100 all Fortran, Dongarra's Table One). The national labs have proven
that point for almost a generation. I believe that it will take at least
one more generation - those who weaned on machines from CDC, then CRI -
before a more reasonable approach to machine procurement comes to pass.
Thus, I disagree that there will *always* be a market for this sort of
thing. Status symbols may be OK in cars, but for machines purchased with
taxpayer dollars, the end is near. Hence, Brooks' "attack of the killer
micros".

However, I do believe that there will always be a market for various
types of processors and processor architectures. Killer scalar micros
are finding wide favor as above. Vector supers and their offspring, e.g.
the i860 and other 64-bit things, will always dominate codes which can
be easily vectorized and do not lend themselves well to parallel computation.
Medium-scale OTS-technology machines like Sequent will start (are starting)
to dominate OLTP and RDBMS work, perfect tasks for symmetric MP machines.
(Pyramid, too; hi Chris). Massive parallel machines will eventually
settle into production shops, perhaps running one and only one application,
but running it at speeds that boggle the mind.

It's up to the manufacturers to decide 1) which game they want to play
2) for what stakes 3) with what competition 4) for how long 5) etc.
etc.etc. That's what makes working for a manufacturer such fun and
terror at once.

Rob
------

John D. McCalpin

unread,

Oct 16, 1989, 2:16:56 PM10/16/89

to

In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene
Brooks) writes:

>Microprocessor development is not ignoring vectorizable workloads. The
>latest have fully pipeline floating point and are capable of pipelining
>several memory accesses. As I noted, interleaving directly on the memory

>chip is trivial and memory chip makers will do it soon. [ ... more
> stuff deleted ... ]

> They will do this with their commodity parts.

It is not at all clear to me that the memory bandwidth required for
running vector codes is going to be developed in commodity parts. To
be specific, a single 64-bit vector pipe requires a sustained
bandwidth of 24 bytes per clock cycle. Is an ordinary, garden-variety
commodity microprocessor going to be able to use 6 32-bit
words-per-cycle of memory bandwidth on non-vectorized code? If not,
then there is a strong financial incentive not to include that excess
bandwidth in commodity products....

In addition, the engineering/cost trade-off between memory bandwidth
and memory latency will continue to exist for the "KILLER MICROS" as
it does for the current generation of supercomputers. Some users will
be willing to sacrifice latency for bandwidth, and others will be
willing to do the opposite. Economies of scale will not eliminate
this trade-off, except perhaps by eliminating the companies that take
the less profitable position (e.g. ETA).

>Supercomputers of the future will be scalable multiprocessors made of
>many hundreds to thousands of commodity microprocessors. They will
>be commodity parts because these parts will be the fastest around and
>they will be cheap.

It seems to me that the experience in the industry is that
general-purpose processors are not usually very effective in
parallel-processing applications. There is certainly no guarantee
that the uniprocessors which are successful in the market will be
well-suited to the parallel supercomputer market -- which is not
likely to be a big enough market segment to have any control over what
processors are built....

The larger chip vendors are paying more attention to parallelism now,
but it appears to be in the context of 2-4 processor parallelism. It
is not likely to be possible to make these chips work together in
configurations of 1000's with the application of "glue" chips....

This is not to mention the fact that software technology for these
parallel supercomputers is depressingly immature. I think traditional
moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle
existing scientific workloads better than 1000-processor parallel
machines for quite some time....

Wm E Davidsen Jr

unread,

Oct 16, 1989, 3:12:29 PM10/16/89

to

There's more to supercomputing than scalar speed. One of the primary
things you can do on a supercomputer is run large programs quickly.
Virtual memory is nice, but some programs cause it to thrash. That's
when it's nice to have a real 4GB machine. The same thing can be said
about vector processing, some programs can be done using vector
processors (or lots of parallel processors) faster than scalar.

I don't see the death of the supercomputer, but a redefinition of
problems needing one. I have more memory on my home computer than all
the computers at this site when I started working here (hell the total
was <2MB). Like wise CPU and even disk. The number of problems which I
can't solve on my home system is a lot smaller than it was back then.

However, thats the kicker, that real problems are limited in size.
Someone said that the reason for micros catching up is that the
development cost could be spread over the users. For just that reason
the vector processors will stay expensive, because fewer users will need
(ie. buy) them. There will always be a level of hardware needed to solve
problems which are not shared by many users. While every problem has a
scalar portion, many don't need vectors, or even floating point.

I think this goes for word size, too. When I see that the Intel 586
will have a 64 bit word I fail to generate any excitement. The main
effect will be to break all the programs which assume that short==16
bits (I've ported to the Cray, this *is* a problem). If you tell me I
can have 64 bit ints, excuse me if I don't feel the need to run right
out and place an order. Even as memory gets cheaper I frequently need
1-2 million ints, and having them double in size is not going to help
keep cost down.

I think that the scalar market will continue to be micros, but I
don't agree with Eric that the demand for supercomputers will vanish,
or that micros will catch them for the class of problems which are
currently being run on supercomputers. The improving scalar performance
will reduce the need for vector processing, and keep them from getting
economies of scale. He may well be right that some of the companies
will fall, since the micros will be able to solve a lot of the problems
which are not massively vectorable or inherently require huge
addressing space.

--
bill davidsen (davi...@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

Hugh LaMaster

unread,

Oct 16, 1989, 3:29:59 PM10/16/89

to

>In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov () writes:
>

This article certainly generated some responses. Unfortunately, some responders
seemed to miss (or chose to ignore :-) the tongue-in-cheek nature of the title.

I used to argue, only a couple of years ago, that supercomputers produced
cheaper scalar computing cycles than "smaller" systems. That isn't true today.
However, supercomputers still produce cheaper floating point results on
vectorizable jobs. And, they produce memory bandwidth cheaper than other
systems. That may change, too.

Q: What will it take to replace a Cray with a bunch of micros?

A: (IMHO) : A "cheap" Multiport Interleaved Memory subsystem. In order to do
that, you need to provide a way to build such subsystems out of a maximum of 3
different chips, and be able to scale the number of processors and interleaving
up and down. A nice goal might be a 4-port/32-way-interleaved 64-bit-wide
subsystem cheap enough for a $100 K system. (That is only enough memory
bandwidth for a 1 CPU Cray-like system, or 4 micro based CPUs with only
1 word/cycle required, but it would sure be a big step
forward.) The subsystem needs to provide single level local-like memory,
like a Cray.

[Or, show a way to make, in software, a truly distributed system as efficient
as a local memory system (PhD thesis material...- I am betting on hardware
solutions in the short run...)].

You also need to provide a reasonably reliable way for the memory to subsystem
connections to be made. This is sort of hard hardware level engineering. For
example, you probably can't afford the space for 32 VME buses...

Does anyone have any suggestions on how the connections into and out of such
memory subsystems could be made without a Cray-sized bundle of connectors?

On the topic of the original posting, what I have seen is that
micro based workstations are eating away fast at the minicomputer market,
just on the basis of price performance, leaving only workstation clusters,
vector machines (Convex-sized to Cray-sized), and other big iron,
such as very large central storage servers. So, I wouldn't write off
big iron just yet, but obviously some companies will be selling a lot more
workstations and a lot fewer minicomputers than they were planning.

Quiz: Why does Cray use *8* way interleaving per memory *port* on the
Cray Y-MP?

Hugh LaMaster, m/s 233-9, UUCP ames!lamaster
NASA Ames Research Center ARPA lama...@ames.arc.nasa.gov
Moffett Field, CA 94035
Phone: (415)694-6117

Robert Colwell

unread,

Oct 16, 1989, 3:30:33 PM10/16/89

to

In article <73...@thor.acc.stolaf.edu> mi...@thor.stolaf.edu () writes:
>In article <10...@m3.mfci.UUCP> col...@mfci.UUCP (Robert Colwell) writes:
>>I take my hat off to them, too, because that's no mean feat. But don't
>>forget that the supercomputers didn't set out to be the fastest machines
>>on scalar code. If they had, they'd all have data caches, non-interleaved
>>main memory, and no vector facilities. What the supercomputer designers
>
>Excuse me, non-interleaved main memory? I've always assumed that
>interleaved memory could help scalar code too. After all, instruction
>fetch tends to take place from successive addresses. Of course if
>main memory is very fast there is no point to interleaving it, but
>if all you've got is drams with slow cycle times, I would expect
>that interleaving them would benefit even straight scalar code.

I meant that as a shorthand way of putting across the idea that the usual
compromise is one of memory size, memory bandwidth, and memory latency.
For the canonical scalar code you don't need a very large memory, and
the bandwidth may not be as important to you as the latency (pointer
chasing is an example).

The point I was making was that the supercomputers have incorporated
design decisions, such as very large physical memory, and very high
bandwidth to and from that memory, so that their multiple functional
units can be kept usefully busy while executing 'parallel' code. Were
you to set out to design a machine which didn't (or couldn't) use those
multiple buses (pin limits on a single-chip micro for instance) then
that bandwidth isn't worth as much to you and you might be better off
with a flat, fast memory, which is what most workstations do (or used
to do, anyway).

Philip A. Naecker

unread,

Oct 16, 1989, 4:37:42 PM10/16/89

to

In article <35...@lll-winken.LLNL.GOV>, bro...@maddog.llnl.gov writes:

> Fact number 1:
> The best of the microprocessors now EXCEED supercomputers for scalar
> performance and the performance of microprocessors is not yet stagnant.
> On scalar codes, commodity microprocessors ARE the fastest machines at
> any price and custom cpu architectures are doomed in this market.

Alas, I believe you have been sucked into the MIPS=Performance falacy. There
is *not* a simple relationship between something as basic as scalar performance
and something as complex as overall application (or even routine) performance.

Case in point: The R2000 chipset implemented on the R/120 (mentioned by others
in this conversation) has, by all measures *excellent* scalar performance. One
would benchmark it at about 12-14 times a microVAX. However, in real-world,
doing-useful-work, not-just-simply-benchmarking situations, one finds that
actual performance (i.e., performance in very simple routines with very simple
algorithms doing simple floating point operations) is about 1/2 that expected.

Why? Because memory bandwidth is *not* as good on a R2000 as it is on other
machines, even machines with considerably "slower" processors. There are
several components to this, the most important being the cache implementation
on an R/120. Other implementations using the R2000/R3000/Rx000 chipsets might
well do much better, but only with considerable effort and cost, both of which
mean that those "better" implementations will begin to approach the price/
performance of the "big" machines that you argue will be killed by the
price/performance of commodity microprocessors.

I think you are to a degree correct, but one must always tailor such
generalities with a dose of real-world applications. I didn't, and I got bit
to the tune of a fine bottle of wine. :-(

Phil

_______________________________________________________________________________
Philip A. Naecker Consulting Software Engineer
Internet: p...@propress.com Suite 101
uunet!prowest!pan 1010 East Union Street
Voice: +1 818 577 4820 Pasadena, CA 91106-1756
FAX: +1 818 577 0073 Also: Technology Editor
DEC Professional Magazine
_______________________________________________________________________________

Hugh LaMaster

unread,

Oct 16, 1989, 4:54:16 PM10/16/89

to

In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks)
writes:

(Another amusing challenge:)

>After all, these little critters mutate and become more voracious every
>6 months and vectorizable code is the only thing left for them to conquer.

(I like the picture of fat computer vendors, or at least fat marketing depts,
hunched together in bunkers hiding from the killer micros. I have no doubt
that they are planning a software counterattack. Watch out for a giant MVS
robot built to save the day! :-)

>No NEW technology needs to be developed, all the micro-chip and memory-chip
>makers need to do is to decide to take over the supercomputer market.
>
> They will do this with their commodity parts.

The only problem I see with this is the interconnection technology. The
*rest* of it is, or will soon be, commodity market stuff.

>Supercomputers of the future will be scalable multiprocessors made of many
>hundreds to thousands of commodity microprocessors.

The appropriate interconnection technology for this has not, to my knowledge,
been determined. Perhaps you might explain how it will be done? The rest,
I agree, is doable at this point, though some of it is not trivial.

Hugh LaMaster

unread,

Oct 16, 1989, 6:01:41 PM10/16/89

to

In article <1...@csinc.UUCP> rpe...@csinc.UUCP (Rob Peglar x615) writes:
>In article <35...@lll-winken.LLNL.GOV>, bro...@maddog.llnl.gov writes:

>that point for almost a generation. I believe that it will take at least
>one more generation - those who weaned on machines from CDC, then CRI -
>before a more reasonable approach to machine procurement comes to pass.

In my experience, gov't labs are very cost conscious. I could tell a lot of
stories on this. Suffice it to say that many people who have come to gov't labs
from private industry get frustrated with just how cost conscious the gov't can
be (almost an exact quote: "In my last company, if we needed another 10GBytes,
all we had to do was ask, and they bought it for us." That was when 10 GBytes
cost $300 K.) The reason supercomputer are used so much is that they get the
job done more cheaply. You may question whether or not new nuclear weapons
need to be designed, but I doubt if the labs doing it would use Crays
if that were not the cheapest way to get the job done. Private industry
concerns with the same kinds of jobs also use supercomputers the same way.
Oil companies, for example. At various times, oil companies have owned more
supercomputers than govt labs.

>Thus, I disagree that there will *always* be a market for this sort of
>thing. Status symbols may be OK in cars, but for machines purchased with
>taxpayer dollars, the end is near. Hence, Brooks' "attack of the killer
>micros".

I will make a reverse claim: People who want status symbols buy PC's for their
office. These PC's, the last time I checked, were only 1/1000th as cost
effective at doing scientific computations as supercomputers. Talk about
*waste*... :-)

Eugene Brooks

unread,

Oct 16, 1989, 9:02:25 PM10/16/89

to

In article <33...@ames.arc.nasa.gov> lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:
>>Supercomputers of the future will be scalable multiprocessors made of many
>>hundreds to thousands of commodity microprocessors.
>
>The appropriate interconnection technology for this has not, to my knowledge,
>been determined. Perhaps you might explain how it will be done? The rest,
>I agree, is doable at this point, though some of it is not trivial.

This is the stuff of research papers right now, and rapid progress is being
made in this area. The key issue is not having the components which establish
the interconnect cost much more than the microprocessors, their off chip caches,
and their main memory. We have been through message passing hypercubes and
the like, which minimize hardware cost while maximizing programmer effort.
I currently lean to scalable coherent cache systems which minimize programmer
effort. The exact protocols and hardware implementation which work best
for real applications is a current research topic. The complexity of the
situtation is much too high for a vendor to just pick a protocol and build
without first running very detailed simulations of the system on real programs.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Eugene Brooks

unread,

Oct 16, 1989, 9:10:58 PM10/16/89

to

In article <MCCALPIN.89...@masig3.ocean.fsu.edu> mcca...@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>The larger chip vendors are paying more attention to parallelism now,
>but it appears to be in the context of 2-4 processor parallelism. It
>is not likely to be possible to make these chips work together in
>configurations of 1000's with the application of "glue" chips....

These microprocessors, for the most part, are being designed to work in
a small processor count coherent cache shared memory environment. This
is the reason why examining scalable coherent cache systems is so imporant.
The same micros, with their capability to lock a cache line for a while
to do an indivisible op, will work fine in the scalable systems.
I agree that they won't be optimal, but they will be within 90% of optimal
and that is all that is required. The MAJOR problem with current micros
in a scalable shared memory environment is their 32 bit addressing.
Unfortunately, no 4 processor system will ever need more than 32 bit
addresses, so we will have to BEG the micro vendors to put in bigger
pointer support..

>This is not to mention the fact that software technology for these
>parallel supercomputers is depressingly immature. I think traditional
>moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle
>existing scientific workloads better than 1000-processor parallel
>machines for quite some time....

The software question is the really hary one, that is why LLNL is
sponsoring the Massively Parallel Computing Initiative. We see
scalable machines being very cost effective and are making a substantial
effort in the application software area.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Eugene Brooks

unread,

Oct 16, 1989, 9:18:18 PM10/16/89

to

In article <33...@ames.arc.nasa.gov> lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:

>I will make a reverse claim: People who want status symbols buy PC's for their
>office. These PC's, the last time I checked, were only 1/1000th as cost
>effective at doing scientific computations as supercomputers. Talk about
>*waste*... :-)

A "PC" with a MIPS R3000 or an Intel i860 in it is about 70 times more cost
effective for scalar codes, and we run a lot of those on our supercomputers
at LLNL, and about 3 to 7 times more cost effective for highly vectorized codes.
In fact, much to our computer center's dismay, research staff are voting with
their wallet and buying these "PC"s in droves. Our computer center is responding
by buying microprocessor powered machines, currently in bus based shared memory
multiprocessor form, but eventually in scalable shared memory multiprocessor form.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Michael Golan

unread,

Oct 16, 1989, 11:47:32 PM10/16/89

to

This came for various people - the references are so confusing I removed them
so as not to put the wrong words in someone's mouth:

>>>Supercomputers of the future will be scalable multiprocessors made of many
>>>hundreds to thousands of commodity microprocessors.
>>

>This is the stuff of research papers right now, and rapid progress is being
>made in this area. The key issue is not having the components which establish

>the interconnect cost much more than the micros, their off chip caches,

>I currently lean to scalable coherent cache systems which minimize programmer
>effort. The exact protocols and hardware implementation which work best
>for real applications is a current research topic.

Last year, I took a graduate level course in parallel computing here at
Princeton. I would like to make the following comments, which are my *own*:

1) There is no parallel machine currently the works faster than non-parallel
machines for the same price. The "fastest" machines are also non-parallel -
these are vector processors.

2) A lot of research is going on - and went on for over 10 years now. As far
as I know, no *really* scalable parallel architecture with shared memory exists
that will scale far above 10 processors (i.e. 100). And it does not seems to
me this will be possible in the near future.
"A lot of research" does not imply any effective results - especially in CS -
just take a look how many people write articles improving time from
O(N log log N) to O(Nlog log log N), which will never be practical for N<10^20
or so (the log log is just an example; you know what I mean).

3) personally I feel parallel computing has no real future as the single cpu
gets a 2-4 folds performance boost every few years, and parallel machines
constructions just can't keep up with that. It seems to me that for at least
the next 10 years, non-parallel machines will still give the best performance
and the best performance/cost.

4) I think Cray-like machines will be here for a long long time. People talk
about Cray-sharing. This is true, but when an engineer needs a simulation to
run and it takes 1 day each time, if you run it on a 2 or 3 day machine, he
sits doing nothing for that time, which costs you a lot, i.e. it is turn-around
time that really matters. And while computers get faster, its seems software
complexity and the need for faster and faster machines is growing even more
rapidly.

Michael Golan
m...@princeton.edu

The opinions expressed above are my *own*. You are welcome not to like them.

Ken Leonard

unread,

Oct 17, 1989, 9:27:48 AM10/17/89

to

In article <12...@cgl.ucsf.EDU> sei...@cgl.ucsf.edu (George Seibel) writes:
* In <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov wrote:
* > On scalar codes, commodity microprocessors ARE the fastest machines at
* > any price and custom cpu architectures are doomed in this market.
*
* Speaking of "commodities",...
* ...
* *commodity Mflops*. The issue is no longer performance at any cost, because
* if it was you would order another machine at that point. The important
* thing is Mflops/dollar for most people, and that's where the micros are
* going to win in a lot of cases.
---- well, first...
Maybe, even, the _commodity_ is _not_ _M_flops per dollar, but just
_flop_flops per dollar? That is, if the cycle time to "set up the problem",
"crunch the numbers", "get the plot/list/display" is under _whatever_ upper
limit fits with _my_ mode of "useful work", then I very likely _do_not_care_
if it gets any shorter (i.e. if the _flop_flops per second per dollar goes
higher). This becomes, IMHO, even more significant if my "useful" cycle time
is available to me _truly_ whenever _I_ darn well feel the need.
All of which works, again, to the advantage of microcrunchers.
---- and, second...
A non-trivial part of the demand for megacrunchers, IMHO, stems from
solution methods which have evolved from the days when _only_ "big"
machines were available for "big" jobs (any jobs?) and _just_had_to_be_
shared. For what _I_ do, anyhow, (and probably a _lot_ of other folk
somewhere out there), the "all-in-one-swell-foop" analyses/programs/techniques
are not the _only_ way to get to the _results_ needed to do the job--and
they may well _not_ be the "best" way. I often find that somewhat more of
somewhat smaller steps get me to my target faster than otherwise. That is,
if I can only get 1 or 2 or 10 passes per day through the megacruncher, the
job takes more work from me and more time on the calendar and more bucks from
whoever is paying the tab, than if I can just as many as I need of smaller
passes.
---- also third...
And those smaller passes may well be easier (and thus faster) to program,
and more amenable to validation/assurance/etc.
And they may admit algorithms which work plenty fast on a dedicated machine
even if it is pretty small but would not work very fast at all on a shared
machine even if it is quite big (maybe especially because it is "big
architecture").
---- so, finally...
I believe in micros.
-------------
regardz,
Ken Leonard

Steve Warren

unread,

Oct 17, 1989, 10:15:25 AM10/17/89

to

In article <33...@ames.arc.nasa.gov> lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:

[...]

>Does anyone have any suggestions on how the connections into and out of such
>memory subsystems could be made without a Cray-sized bundle of connectors?

[...]
Multiplexed optical busses driven by integrated receivers with the optics,
decoders, and logic-level drivers on the same substrate. It's the obvious
solution (one I think many companies are working on).

DISCLAIMER: This opinion is in no way related to my employment with
Convex Computer Corporation. (As far as I know we aren't
working on optical busses, but then I'm not in New Products).

--Steve
-------------------------------------------------------------------------
{uunet,sun}!convex!swarren; swa...@convex.COM

Steve Warren

unread,

Oct 17, 1989, 10:57:44 AM10/17/89

to

In article <20...@princeton.Princeton.EDU> m...@notecnirp.edu (Michael Golan) writes:
>Last year, I took a graduate level course in parallel computing here at
>Princeton. I would like to make the following comments, which are my *own*:
>
>1) There is no parallel machine currently the works faster than non-parallel
>machines for the same price. The "fastest" machines are also non-parallel -
>these are vector processors.
>

The Cray XMP with one processor costs approx. $2.5M. The 4 processor Convex
C240S costs $1.5M. On typical scientific applications the performance of the
240S is about 140% of the single processor Cray XMP. (The 240S is the newest
model with enhanced performance CPUs).

Also, vector processors are technically nonparallel, but the implementation
involves parallel function units that are piped up so that at any one instant
in time there are multiple operations occurring. Vectors are a way of doing
parallel processing on a single stream of data.

These were the only points I would disagree with.

Rob Peglar x615

unread,

Oct 17, 1989, 11:09:33 AM10/17/89

to

In article <33...@ames.arc.nasa.gov>, lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:
> In article <1...@csinc.UUCP> rpe...@csinc.UUCP (Rob Peglar x615) writes:
>
> >that point for almost a generation. I believe that it will take at least
> >one more generation - those who weaned on machines from CDC, then CRI -
> >before a more reasonable approach to machine procurement comes to pass.
>
> In my experience, gov't labs are very cost conscious. I could tell a lot of
> stories on this. Suffice it to say that many people who have come to gov't labs
> from private industry get frustrated with just how cost conscious the gov't can
> be (almost an exact quote: "In my last company, if we needed another 10GBytes,
> all we had to do was ask, and they bought it for us." That was when 10 GBytes
> cost $300 K.) The reason supercomputer are used so much is that they get the
> job done more cheaply. You may question whether or not new nuclear weapons
> need to be designed, but I doubt if the labs doing it would use Crays
> if that were not the cheapest way to get the job done. Private industry
> concerns with the same kinds of jobs also use supercomputers the same way.
> Oil companies, for example. At various times, oil companies have owned more
> supercomputers than govt labs.

Good point. However, oil companies in particular are notorious for having
procurements follow the "biggest and baddest = best" philosophy. Hugh, you
know as well as I that supercomputer procurement is not a rational or
scientific process - it's politics, games, and who knows who. Cheap,
efficient, usable, etc.etc. - all take a back seat to politics. However,
if the "job" is defined as running one (or some small number of) code(s)
for hours then there is no question that only a super will do. The point
that Brooks' doesn't make, but implies only, is that the *way* scientific
computing is being done changes all the time. One-job killer codes are
becoming less prevalent. The solutions must change as the workload
changes. Sure, there are always codes which cannot be run (Lincoln's
attributed quote (compressed) - "supercomputer == only one generation
behind the workload" - but yesterdays' killer code, needing 8 hours of
4 million 64-bit words, can now be done on the desktop. (see below)

>
> >Thus, I disagree that there will *always* be a market for this sort of
> >thing. Status symbols may be OK in cars, but for machines purchased with
> >taxpayer dollars, the end is near. Hence, Brooks' "attack of the killer
> >micros".
>
> I will make a reverse claim: People who want status symbols buy PC's for their

Please. Are you saying that NAS,LLNL,LANL,etc.etc. don't compete for status
defined as big,bad hardware? Just the glorious battle between Ames and
Langley provides one with enough chuckles to last quite a while.

> office. These PC's, the last time I checked, were only 1/1000th as cost
> effective at doing scientific computations as supercomputers. Talk about
> *waste*... :-)
>
>

Look again. I'll give you a real live example. Buy any 386 33mhz machine,
with a reasonable cache (e.g. at least 128 kB) of fast SRAM, and 8 MB
or so of (slower) DRAM. Plug in a Mercury co-processor board, and use
Fortran (supplied by Mercury) to compile Dr. Dongarra's Table One
Linpack. Results on PC - 1.8Mflops. Using a coded BLAS, you get 4.7
Mflops. This is 64-bit math. Last time *I* checked, the Cray Y-MP
stood at 79 Mflops. Cost of Cray Y-MP? You and I know what that is.
Even discounting life cycle costing (which for any Cray machine, is
huge due to bundled maintenance, analysts, etc.etc.), the performance ratio
of Y to PC is 79/1.8 = 43.88. I'll bet my year's salary that the price
ratio is higher than that. To ballpark, price for the PC setup is
around $20K. Moving down all the time. Even if the Y-MP 1/32 was
only $2M (which it is not) that would be 100:1 price ratio.

Of course, that is only one code. Truly, your mileage will vary. The
price/performance ratio of an overall system is dependent on many
variables.

After all that, Brooks' point is still valid. Micros using commodity
HW and cheap (sometimes free) software are closing the gap. They have
already smashed the price/performance barrier(for many codes), and the
slope of their absolute performance improvements over time is much larger
than any of the true super vendors (any==1 now, at least US)

The game is nearly over.

Rob

...uunet!csinc!rpeglar

Hugh LaMaster

unread,

Oct 17, 1989, 12:07:17 PM10/17/89

to

In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks) writes:

>A "PC" with a MIPS R3000 or an Intel i860 in it is about 70 times more cost

A quick clarification: The "PC's" I was talking about are IBM PC's and clones
based on Intel 80x86 chips, *not* SGI or DEC machines based on R3000/R3010s.
"PC" may also be extended to Apple Mac and Mac II machines by some people.
Most of the "PC" boosters that I am thinking of, and from which we have heard
in this newsgroup recently, are also "offended" by the
"excessive" power and cost of MIPSCo based machines. Not me, obviously, but
most of these people do not consider an SGI 4D/25 a "PC".

>effective for scalar codes, and we run a lot of those on our supercomputers
>at LLNL, and about 3 to 7 times more cost effective for highly vectorized
> codes.

Well, I admit, I hadn't done a calculation for some months. Last time I did it,
I was somewhat disappointed by the inflated claims surrounding micro based
systems. I have been hearing "wolf!" for 15 years, so it is easy to be blase'
about it. But, this USENET discussion stimulated me
to look at it again. Another quick calculation shows a *big change*. It appears
to me, on the face of it, that cost/delivered FLOP is now about even. I don't
see the 3 -7 X advantage to the micros yet, but maybe you are looking at the
faster 60-100MHz systems that will fast be arriving. I used SGI 4D/280's
as the basis of comparison, since that appears to be the most cost effective
of such systems that I have good pricing information on. Anyway, how long has
it taken Cray to shave a few ns off the clock? In less than a year we should
see systems based on the new micro chips. Yikes. It looks like the
ATTACK OF THE KILLER MICROS.

Peter Desnoyers

unread,

Oct 17, 1989, 1:08:54 PM10/17/89

to

> > In my experience, gov't labs are very cost conscious. I could tell a
> > lot of stories on this. Suffice it to say that many people who have come
> > to gov't labs from private industry get frustrated with just how cost

> > conscious the gov't can be. (almost an exact quote: "In my last company,

> > if we needed another 10GBytes, all we had to do was ask, and they bought
> > it for us." That was when 10 GBytes cost $300 K.) The reason
> > supercomputer are used so much is that they get the job done more
> > cheaply.

From what I know of DOD procurement (my father works for a US Navy lab)
one factor may be that the time and effort needed to justify spending
$25,000 of Uncle Sam's money on a super-micro, along with the effort of
spec'ing it as sole-source or taking bids, is no doubt far more than
1/400th the effort needed to procure a $10M supercomputer.

Peter Desnoyers
Apple ATG
(408) 974-4469

Eugene Brooks

unread,

Oct 17, 1989, 1:39:13 PM10/17/89

to

In article <20...@princeton.Princeton.EDU> m...@notecnirp.edu (Michael Golan) writes:

>1) There is no parallel machine currently the works faster than non-parallel
>machines for the same price. The "fastest" machines are also non-parallel -
>these are vector processors.

This is false. There are many counter examples for specific applications.

>2) A lot of research is going on - and went on for over 10 years now. As far
>as I know, no *really* scalable parallel architecture with shared memory exists
>that will scale far above 10 processors (i.e. 100). And it does not seems to
>me this will be possible in the near future.

Again, this is wrong. Many scalable architectures exist in the literature
and some of them are well proven using simulation on real application codes.

>3) personally I feel parallel computing has no real future as the single cpu
>gets a 2-4 folds performance boost every few years, and parallel machines
>constructions just can't keep up with that. It seems to me that for at least
>the next 10 years, non-parallel machines will still give the best performance
>and the best performance/cost.

Massively parallel computing has a future because the performance increases
are 100 or 1000 fold. I agree with the notion that using 2 processors, if
the software problems are severe, is not worth it because next years micro
will be twice as fast. Next years supercomputer, however, will not be twice
as fast.

>4) I think Cray-like machines will be here for a long long time. People talk
>about Cray-sharing. This is true, but when an engineer needs a simulation to
>run and it takes 1 day each time, if you run it on a 2 or 3 day machine, he
>sits doing nothing for that time, which costs you a lot, i.e. it is turn-around
>time that really matters. And while computers get faster, its seems software
>complexity and the need for faster and faster machines is growing even more
>rapidly.

Cray like machines will be here for a long time indeed. They will, however,
be implemented on single or nearly single chip microprocessors. I do not
think that the "architecture" is bad, only the implementation has become
nearly obselete. It is definitely obselete for scalar code and vectorized
code will follow within 5 years.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Horace Dediu

unread,

Oct 17, 1989, 3:02:16 PM10/17/89

to

In article <20...@princeton.Princeton.EDU>, m...@notecnirp.Princeton.EDU (Michael Golan) writes:
> This came for various people - the references are so confusing I removed them
> so as not to put the wrong words in someone's mouth:
>
> >>>Supercomputers of the future will be scalable multiprocessors made of many
> >>>hundreds to thousands of commodity microprocessors.
> >>
> >This is the stuff of research papers right now, and rapid progress is being
> >made in this area. The key issue is not having the components which establish
> >the interconnect cost much more than the micros, their off chip caches,
> >I currently lean to scalable coherent cache systems which minimize programmer
> >effort. The exact protocols and hardware implementation which work best
> >for real applications is a current research topic.
>

> 1) There is no parallel machine currently the works faster than non-parallel
> machines for the same price. The "fastest" machines are also non-parallel -
> these are vector processors.
>

Consider the 8k processor NCUBE 2--"The World's Fastest Computer."
(yes, one of those). According to their literature:
"8,192 64 bit processors each equivalent to one VAX 780. It delivers
60 billion instructions per second, 27 billion scalar FLOPS, exceeding the
performance of any other currently available or recently announced
supercomputer." It's distributed memory .5MB per processor, runs UNIX,
and is a hypercube.

I don't know the price, but I bet it's less than a Cray. Interesting to
talk about GigaFLOPS. This is fast.

> 2) A lot of research is going on - and went on for over 10 years now. As far
> as I know, no *really* scalable parallel architecture with shared memory exists
> that will scale far above 10 processors (i.e. 100). And it does not seems to
> me this will be possible in the near future.

Who cares about shared memory? Distributed is the only way to scale.
Everybody realizes this since it can be proven.
The only reason shared memory machines exist is because we don't yet know
how to make good distributed machines. (Yeah, right! tell that to Ncube)
IMHO shared memory is a hack using available bus technology while waiting for
the real parallel machines to come. (they're already here)

> 3) personally I feel parallel computing has no real future as the single cpu
> gets a 2-4 folds performance boost every few years, and parallel machines
> constructions just can't keep up with that. It seems to me that for at least
> the next 10 years, non-parallel machines will still give the best performance
> and the best performance/cost.

This is very ambiguous. Parallel machines can use off-the shelf CPU's. If a
fast micro is available then you can design a parallel machine around it as
you would any workstation. The other problem: if cpu's increase 2-4
folds every few years and if this can be maintained for 10 years you can
only expect a 32 fold increase. This is nothing. You can't expect problems
to stay that small. If you expect to go beyond that you'll hit a wall with
the fundamental boundary of the speed of light. You can't drive clock rates
to infinity. The only way to speed up is to do it in parallel. Sure it's
hard to program, but it's a new field, the tools are rudimentary and only
hardware people are involved in their development. If enough effort is put
into it parallel machines should not be any harder to program than your
basic workstation.

> time that really matters. And while computers get faster, its seems software
> complexity and the need for faster and faster machines is growing even more
> rapidly.

Of course. To solve hard problems you *need* parallel execution.
It's no secret that every big iron maker and every supercomputer shop is
developing parallel machines. These are still modest efforts (<100 cpu's),
but the leading egde is now in the 10k coarse grained, 64k fine grained
processors. This should scale nicely to 1M processors in the next decade.
After that we can expect some kind of new barriers to come up.

> Michael Golan
> m...@princeton.edu

--
Horace Dediu \"That's the nature of research--you don't know |GTE Laboratories
(617) 466-4111\ what in hell you're doing." `Doc' Edgerton |40 Sylvan Road
UUCP: ...!harvard!bunny!hhd0................................|Waltham, MA 02254
Internet: hh...@gte.com or hhd0%gte...@relay.cs.net..........|U. S. A.

Hugh LaMaster

unread,

Oct 17, 1989, 3:28:51 PM10/17/89

to

In article <36...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks) writes:

>Cray like machines will be here for a long time indeed. They will, however,
>be implemented on single or nearly single chip microprocessors. I do not
>think that the "architecture" is bad, only the implementation has become
>nearly obselete. It is definitely obselete for scalar code and vectorized
>code will follow within 5 years.

I agree with you here. In fact, did anyone notice a recent newspaper article
(In Tuesday's Merc. News - from Knight Ridder:)

"Control Data to use Mips design"

"Control Data Corp. has cast its lot with Mips Computer Systems, inc. to design
the brains of its future computers, choosing a new computer architecture
developed by the Sunnyvale Company."

...

"The joint dev. agreement with Mips means Control Data will use [...] the RISC
architecture developed by that firm..."

Ian L. Kaplan

unread,

Oct 17, 1989, 4:39:54 PM10/17/89

to

In article <20...@princeton.Princeton.EDU> m...@notecnirp.edu (Michael Golan) writes:

>Last year, I took a graduate level course in parallel computing here at

>Princeton. [...]

>
>1) There is no parallel machine currently the works faster than non-parallel
>machines for the same price. The "fastest" machines are also non-parallel -
>these are vector processors.
>

Clearly this is proof that a little knowledge is a dangerous thing.

Even n-cube machines run applications like Monte Carlo simulation
with _much_ better price performance than supercomputers. Now it
might be claimed that this is a special class of applications.
However parallel processors are not limited to n-cubes. The
Connection Machine has beaten Cray machines on a number of classic
vectorizable codes (e.g., fluid flow). For reference see "Proceedings
of the Conference on Scientific Applications of the Connection
Machine", 1988, Edited by H. D. Simon, World Scientific press. Note
that the Connection Machine is probably less than half the cost of the
Cray. I am sure that even cheaper SIMD processors will appear in the
near future.

>2) A lot of research is going on - and went on for over 10 years now. As far
>as I know, no *really* scalable parallel architecture with shared
>memory exists that will scale far above 10 processors (i.e. 100). And
>it does not seems to me this will be possible in the near future.

By this narrow deffinition, the statement is more or less correct.
Classic shared memory MIMD systems with snoopy caches saturate
rapidly. However, this is simply the wrong approach to the problem.
SIMD architectures like the Connection Machine are scalable. Perhaps
you did not study these.

>3) personally I feel parallel computing has no real future as the single cpu
>gets a 2-4 folds performance boost every few years, and parallel machines
>constructions just can't keep up with that. It seems to me that for at least
>the next 10 years, non-parallel machines will still give the best performance
>and the best performance/cost.

This statement is not even true for MIMD processors.
Multiprocessor, with shared memory, used as file servers have _much_
better performance than comparable single processor systems. Then
there are the issues of fault tolerance. When a Multiprocessor fails,
you pull a board out and go on. When a uni-processor fails, you wait
for field service. Sequent is doing a booming business selling shared
memory multiprocessors. Even DEC is selling a multiprocessor (the
VAXstation 3520).

Ian Kaplan
MasPar Computer Corp.
argosy!i...@decwrl.dec.com

The opinions expressed here are not necessarily shared by MasPar
Computer Corp.

Anton Rang

unread,

Oct 17, 1989, 7:15:46 PM10/17/89

to

In article <3...@argosy.UUCP> i...@argosy.UUCP (Ian L. Kaplan) writes:
>In article <20...@princeton.Princeton.EDU> m...@notecnirp.edu (Michael Golan) writes:
>>2) A lot of research is going on - and went on for over 10 years now. As far
>>as I know, no *really* scalable parallel architecture with shared
>>memory exists that will scale far above 10 processors (i.e. 100). And
>>it does not seems to me this will be possible in the near future.
>
> By this narrow deffinition, the statement is more or less correct.
>Classic shared memory MIMD systems with snoopy caches saturate
>rapidly. However, this is simply the wrong approach to the problem.
>SIMD architectures like the Connection Machine are scalable. Perhaps
>you did not study these.

The group working on the IEEE SCI (scalable coherent interconnect)
claims that they will be able to handle up to 65,536 processors in
either a message-passing or shared-memory environment. I haven't had
a chance to read all their working papers, but it does seem that their
stuff should scale well to 100-500 processors, at least.
Their goal is 1 GB/sec bandwidth per processor (16 bits every 2 ns),
and the interface chips are supposed to be available next year. (I'm
eagerly waiting to see what this looks like... :-)

The reference number for the standard is IEEE P1596....

Gregory G. Woodbury

unread,

Oct 17, 1989, 10:10:03 PM10/17/89

to

In article <MCCALPIN.89...@masig3.ocean.fsu.edu> mcca...@masig3.ocean.fsu.edu (John D. McCalpin) writes:

>In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene
>Brooks) writes:
>
>>Microprocessor development is not ignoring vectorizable workloads. The
>>latest have fully pipeline floating point and are capable of pipelining
>>several memory accesses.

>>[ ... more stuff deleted ... ]
>

>It is not at all clear to me that the memory bandwidth required for
>running vector codes is going to be developed in commodity parts. To
>be specific, a single 64-bit vector pipe requires a sustained
>bandwidth of 24 bytes per clock cycle. Is an ordinary, garden-variety
>commodity microprocessor going to be able to use 6 32-bit
>words-per-cycle of memory bandwidth on non-vectorized code? If not,
>then there is a strong financial incentive not to include that excess
>bandwidth in commodity products....
>

This is quite a statement. Don't forget - even if the micro
can not make FULL use of a vector pipeline, including one will enhance
performance significantly. The theoretical folks in this forum are quite
useful in the development of theoretical maxima, but even some partial
vector capabilities in a floating point unit will be greeted with joy.

Lots and lots of "commodity" programs out there do things that would
benefit from some primitive vector computations. Just in the past couple of
weeks we have had some discussions here about the price/performance
aspects of these "Killer Micros". ( I do want to acknowledge that my
price figures were a little skewed - another round of configuration work
with various vendors has shown that I can find a decent bus speed and
SCSI disks in the required price range - thanks for some of the pointers!)

>
>In addition, the engineering/cost trade-off between memory bandwidth
>and memory latency will continue to exist for the "KILLER MICROS" as
>it does for the current generation of supercomputers. Some users will
>be willing to sacrifice latency for bandwidth, and others will be
>willing to do the opposite. Economies of scale will not eliminate
>this trade-off, except perhaps by eliminating the companies that take
>the less profitable position (e.g. ETA).

This is an good restatment of the recent "SCSI on steroids"
discussion. The vendor who can first put a "real" supercomputer or
"real" mainframe on (or beside) the desktop for <$50,000 will make a
killing. Calling something a "Personal Mainframe" makes marketing happy,
but not being able to keep that promise makes for unhappy customers ;-)

--
Gregory G. Woodbury
Sysop/owner Wolves Den UNIX BBS, Durham NC
UUCP: ...dukcds!wolves!ggw ...dukeac!wolves!ggw [use the maps!]
Domain: g...@cds.duke.edu g...@ac.duke.edu ggw%wol...@ac.duke.edu
Phone: +1 919 493 1998 (Home) +1 919 684 6126 (Work)
[The line eater is a boojum snark! ] <standard disclaimers apply>

Shahin Kahn

unread,

Oct 18, 1989, 2:45:30 AM10/18/89

to

In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks) writes:

>In article <21...@brazos.Rice.edu> pre...@titan.rice.edu (Preston Briggs) writes:

>>In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov () writes:
>>>The best of the microprocessors now EXCEED supercomputers for scalar

>>Is this a fair statement? I've played some with the i860 and
>Yes, in the sense that a scalar dominated program has been compiled for
>the i860 with a "green" compiler, no pun intended, and the same program
>was compiled with a mature optimizing compiler on the XMP, and the 40MHZ
>i860 is faster for this code. Better compilers for the i860 will open

The Cray-XMP is considerably slower than the YMP.
The single-processor XMP is no-longer a supercomputer.
Take a program requiring more than 128MBytes of memory (or 64 MBytes
for that matter (but I personally prefer more than 256M to excerice the
VM system alittle!)) (i.e. a relatively BIG job, a *supercomputer* job)
and then compare any micro you want
or any other system you want with the YMP. or something in
that class. and then try it on a multiprocessor YMP, and Please
STOP USING A SINGLE-PROCESSOR xmp AS THE DEFINITION OF A SUPERCOMPUTER,
thank you.
And it would be nice if people used "LIST PRICE" for "COMPLETE SYSTEMS"
when comparing prices. (LIST PRICE = PEAK PRICE !!) (COMPLETE SYSTEM =
with all needed software and a few GBytes of disk with a few controllers)

Shahin Kahn

unread,

Oct 18, 1989, 2:59:40 AM10/18/89

to

In article <1...@csinc.UUCP> rpe...@csinc.UUCP (Rob Peglar x615) writes:

>(pick your number, the most popular one recently seems to be Linpack
>100x100 all Fortran, Dongarra's Table One). The national labs have proven

Throw away ALL your copies of the LINPACK 100x100 benchmark if you
are interested in supercomputers. The 300x300 is barely big enough
and uses a barely good-enough algorithm to qualify for supercomputer
comparison as a low-impact guideline only.
JJD has lots of warning words in the first paragraphs of his list
but looks like most people go right to the table and never read the
paper.

If you must use a single-program benchmark, use the lesson taught
by the Sandia people (JohnGustafson, et.al.): Keep the time fixed
and vary the problem size.

Preston Briggs

unread,

Oct 18, 1989, 4:23:46 AM10/18/89

to

In article <90...@batcomputer.tn.cornell.edu> ka...@tcgould.tn.cornell.edu writes:

>Throw away ALL your copies of the LINPACK 100x100 benchmark if you
>are interested in supercomputers. The 300x300 is barely big enough

Danny Sorenson mentioned recently that linpack is sort of intended
to show how *bad* a computer can be. The sizes are kept
deliberately small so that the vector machines barely have a chance
to get rolling.

So, big if you're optimistic; small otherwise.

Preston Briggs

John D. McCalpin

unread,

Oct 18, 1989, 10:39:33 AM10/18/89

to

In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene
Brooks) writes:

>Supercomputers of the future will be scalable multiprocessors made of

>many hundreds to thousands of commodity microprocessors. They will be
>commodity parts because these parts will be the fastest around and
>they will be cheap. These scalable machines will have hundreds of
>commodity disk drives ganged up for parallel access. Commodity parts
>will again be used because of the cost advantage leveraged into a
>scalable system using commodity parts. The only custom logic will be
>the interconnect which glues the system together, and error correcting
>logic which glues many disk drives together into a reliable high
>performance system. The CM data vault is a very good model here.

I think that it is interesting that you expect the same users who
can't vectorize their codes on the current vector machines to be able
to figure out how to parallelize them on these scalable MIMD boxes.
It seems to me that the automatic parallelization problem is much
worse than the automatic vectorization problem, so I think a software
fix is unlikely....

In fact, I think I can say it much more strongly than that:
Extrapolating from current experience with MIMD machines, I don't
think that the fraction of users that can use a scalable MIMD
architecture is likely to be big enough to support the economies of
scale required to compete with Cray and their vector machines. (At
least for the next 5 years or so).

I *do* think is that the romance with vector machines has worn off,
and people are realizing that they are not the answer to everyone's
problems. This is a good thing --- I like it when people migrate their
scalar codes off of the vector machines that I am trying to get time
on!!!

What is driving the flight from traditional supercomputers to
high-performance micros is turnaround time on scalar codes. From my
experience, if the code is really not vectorizable, then it is
probably not parallelizable either, and scalable machines won't scale.
These users are going to want the fastest single-processor micro
available, unless their memory requirements are too big their ability
to purchase.

The people who can vectorize their codes are still getting 100:1
improvements going to supercomputers --- my code is over 500 times
faster on an 8-cpu Cray Y/MP than on a 25 MHz R-3000/3010. So the
market for traditional supercomputers won't disappear, it will just be
more limited than many optimists have predicted.
--
John D. McCalpin - mcca...@masig1.ocean.fsu.edu
mcca...@scri1.scri.fsu.edu
mcca...@delocn.udel.edu

Ross Alexander

unread,

Oct 18, 1989, 12:47:27 PM10/18/89

to

bro...@vette.llnl.gov (Eugene Brooks) writes:

>In article <33...@ames.arc.nasa.gov> (Hugh LaMaster) writes:
>>[...] a reverse claim: People who want status symbols buy PC's for their

>>office. These PC's, the last time I checked, were only 1/1000th as cost
>>effective at doing scientific computations as supercomputers. Talk about
>>*waste*... :-)
>A "PC" with a MIPS R3000 or an Intel i860 in it is about 70 times more cost
>effective for scalar codes, and we run a lot of those on our supercomputers

C'mon, Eugene, address the claim, not a straw man of your own
invention. Hugh means people who buy intel-hackitecture machines from
Big Blue. Do you really mean LLNL people buy mips-engine boxes as
office status symbols? Not a very nice thing to say about your own
team ;-). And then you contradict yourself by saying these same
mips-or-whatever boxes are 70 times more effective: are they status
symbols, or are they machines to do work? Make up your mind :-) :-).

Ross

Wm E Davidsen Jr

unread,

Oct 18, 1989, 12:55:31 PM10/18/89

to

In article <33...@ames.arc.nasa.gov>, lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:

| I will make a reverse claim: People who want status symbols buy PC's for their
| office. These PC's, the last time I checked, were only 1/1000th as cost
| effective at doing scientific computations as supercomputers. Talk about
| *waste*... :-)

What you say is true, but you seem to draw a strange conclusion from
it... very few people do scientific calculations on a PC. They are used
for spreadsheets, word processing, and even reading news ;-) These are
things which supercomputers do poorly. Benchmark nroff on a Cray...
EGAD! it's slower than an IBM 3081! Secondly *any* computer becomes less
cost effective as it is used less. Unless you have the workload to
heavily use a supercomputer you will find the cost gets really steep.

Think of it this way, a technical worker cost a company about $50000 a
year (or more), counting salary and benefits. The worker works 240 days
a year (2 weeks vacation, 10 days holiday and sick), at a cost per
*working hour* of $26 more or less. For a $1600 PC to be cost effective
in just a year it must save about 16 minutes a day, which is pretty easy
to do. You also get increased productivity.

Obviously not every PC is utilized well. Neither are workstations (how
many hours drawing fractals and playing games) or supercomputers, for
that matter. That problem is a management issue, not a factor of
computer size.
--
bill davidsen (davi...@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

Terry Ingoldsby

unread,

Oct 18, 1989, 2:24:39 PM10/18/89

to

In article <MCCALPIN.89...@masig3.ocean.fsu.edu>, mcca...@masig3.ocean.fsu.edu (John D. McCalpin) writes:
> In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene
> Brooks) writes:
>

> >Microprocessor development is not ignoring vectorizable workloads. The
> >latest have fully pipeline floating point and are capable of pipelining

...
> It seems to me that the experience in the industry is that
> general-purpose processors are not usually very effective in
> parallel-processing applications. There is certainly no guarantee
> that the uniprocessors which are successful in the market will be
> well-suited to the parallel supercomputer market -- which is not
> likely to be a big enough market segment to have any control over what
> processors are built....

Agreed. The only general purpose systems that I am aware of that exploit
parallel processing do so through specialized processors to handle certain
functions (eg. matrix multipliers, I/O processors) or have a small (< 16)
number of general purpose processors.

>
> The larger chip vendors are paying more attention to parallelism now,
> but it appears to be in the context of 2-4 processor parallelism. It
> is not likely to be possible to make these chips work together in
> configurations of 1000's with the application of "glue" chips....

It doesn't seem to be just a case of using custom designed chips as
opposed to generic glue. The problem is fundamentally one of designing
a system that allows the problem to be divided across many processors
AND (this is the tricky part) that provides an efficient communication
path between the sub-components of the problem. In the general case
this may not be possible. Note that mother nature hasn't been able to
do it (eg. the human brain isn't very good at arithmetic, but for other
applications its stupendous).

>
> This is not to mention the fact that software technology for these
> parallel supercomputers is depressingly immature. I think traditional
> moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle
> existing scientific workloads better than 1000-processor parallel
> machines for quite some time....

> --
I don't think we should berate ourselves about the techniques available
for splitting workloads. No one has ever proved that such an activity
is even possible for most problems (at a *large* scale). The activities
that are amenable to parallel processing (eg. image processing, computer
vision) will probably only be feasible on architectures specifically
designed for those functions.

Note that I'm not saying to give up on parallel processing; on the contrary
I believe that it is the only way to do certain activities. I am saying
that the notion of a general purpose massively parallel architecture that
efficiently executes all kinds of algorithms is probably a naive and
simplistic view of the world.

--
Terry Ingoldsby ctycal!ingo...@calgary.UUCP
Land Information Systems or
The City of Calgary ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

Stan Lackey

unread,

Oct 18, 1989, 2:42:17 PM10/18/89

to

In article <MCCALPIN.89...@masig3.ocean.fsu.edu> mcca...@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene
>Brooks) writes:
>
>>Supercomputers of the future will be scalable multiprocessors made of
>>many hundreds to thousands of commodity microprocessors.

>I think that it is interesting that you expect the same users who

>can't vectorize their codes on the current vector machines to be able
>to figure out how to parallelize them on these scalable MIMD boxes.
>It seems to me that the automatic parallelization problem is much

>worse than the automatic vectorization problem, ...

Yes, there seems to be the perception running around that
"parallelization" must be harder than "vectorization". I am not
saying it isn't, because I and not a compiler writer, but I sure can
give some reasons why it might not be.

Vectorization requires the same operation to be repeatedly performed
on the elements of a vector. Parallel processors can perform
different operations, such as conditional branching within a loop that
is being performed in parallel.

Dependencies between loop iterations can be handled in a PP that has
the appropriate communication capabilities, whereas most (all?)
vector machines require that all elements be independent (except for
certain special cases, like summation and dot product.) This can be
done by message passing, or if you have shared memory, with
interlocks.

Parallel processors are not limited to operations for which there are
corresponding vector instructions provided in the hardware.

Well that's all I can think of right now. Anyone else care to add
anything?
-Stan

Chris Moriondo

unread,

Oct 18, 1989, 6:50:29 PM10/18/89

to

In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks) writes:

>This is the stuff of research papers right now, and rapid progress is being
>made in this area. The key issue is not having the components which establish
>the interconnect cost much more than the microprocessors, their off chip
>caches, and their main memory.

The only really scalable interconnect schemes of which I am aware are
multistage interconnects which grow (N log N) as you linearly increase the
numbers of processors and memories. So in the limit the machine is essentially
ALL INTERCONNECT NETWORK, which obviously costs more than the processors and
memories. (Maybe this is what SUN means when they say "The Network IS the
computer"? :-) How do you build a shared-memory multi where the cost of the
interconnect scales linearly? Obviously I am discounting busses, which don't
scale well past very small numbers of processors.

>We have been through message passing hypercubes and
>the like, which minimize hardware cost while maximizing programmer effort.
>I currently lean to scalable coherent cache systems which minimize programmer
>effort.

While message passing multicomputers maximize programmer effort in the sense
that they don't lend themselves to "dusty deck" programs, they have the
advantage that the interconnect costs scale linearly with the size machine.
They also present a clean programmer abstraction that presents the true cost
of operations to the programmer. I read a paper by (I think) Larry Snyder
wherein he argued that the PRAM abstraction causes programmer to produce
suboptimal parallel algorithms by leading one to think that simple operations
have linear cost when in reality they can't be better than N log N.

chrism

-- Insert usual disclaimers here --

Bruce Albrecht

unread,

Oct 18, 1989, 6:52:41 PM10/18/89

to

In article <35...@lll-winken.LLNL.GOV>, bro...@vette.llnl.gov (Eugene Brooks) writes:
> Unfortunately, no 4 processor system will ever need more than 32 bit
> addresses, so we will have to BEG the micro vendors to put in bigger
> pointer support..

Oh really? CDC has several customers that have databases that exceed 2**32
bytes. Our file organization considers files to be virtual memory segments.
We already need pointers larger than 32 bits. IBM's AS400 has a virtual
address space greater than 32 bits, too. If the micro venders don't see a
need for it, they're not paying attention to what the mainframes are really
providing for their very large system customers.

Chris Moriondo

unread,

Oct 18, 1989, 7:01:46 PM10/18/89

to

In article <20...@princeton.Princeton.EDU> m...@notecnirp.edu (Michael Golan) writes:

>3) personally I feel parallel computing has no real future as the single cpu
>gets a 2-4 folds performance boost every few years, and parallel machines
>constructions just can't keep up with that. It seems to me that for at least
>the next 10 years, non-parallel machines will still give the best performance
>and the best performance/cost.

Actually, the rate of improvement in single cpu performance seems to have
flattened out in recent supercomputers, and they have turned to more
parallelism to continue to deliver more performance. If you project the
slope of the clock rates of supercomputers, you will see sub-nanosecond
CYCLE times before 1995. I don't see any technologies in the wings which
promise to allow this to continue...

chrism

Stephen Tihor

unread,

Oct 18, 1989, 10:28:00 PM10/18/89

to

ACutally all of DEC's new systems above the desktop are multiprocessors. Gotta
keep moving up to avoid them there killer micro's.

victor yodaiken

unread,

Oct 18, 1989, 11:44:59 PM10/18/89

to

In article <76...@bunny.GTE.COM> hh...@GTE.COM (Horace Dediu) writes:
>Consider the 8k processor NCUBE 2--"The World's Fastest Computer."
>(yes, one of those). According to their literature:
>"8,192 64 bit processors each equivalent to one VAX 780. It delivers
>60 billion instructions per second, 27 billion scalar FLOPS, exceeding the
>performance of any other currently available or recently announced
>supercomputer." It's distributed memory .5MB per processor, runs UNIX,
>and is a hypercube.
>
>I don't know the price, but I bet it's less than a Cray.

Like to see the delivered price of a 8k processor system.

>Interesting to
>talk about GigaFLOPS. This is fast.
>

This sounds like one of those total b.s. measures obtained by
multiplying the number of processors by the max mips/mflops rate
per processor.

>
>Who cares about shared memory? Distributed is the only way to scale.
>Everybody realizes this since it can be proven.

Proof citation? Sketch?

There is a lot of mythologizing about parallelism. Parallel processing
is a standard technique for speed which is used in every carry-lookahead
adder, every bus, etc. It seems reasonable to believe that parallelism
will be an important technique in the future. It seems POSSIBLE that
using multiple cpu's will be a useful technique. On the other hand
there is no reason why this technique must work, and it seems at least
as possible that cpu's should not be the basic unit of parallel
computation.

>It's no secret that every big iron maker and every supercomputer shop is
>developing parallel machines. These are still modest efforts (<100 cpu's),
>but the leading egde is now in the 10k coarse grained, 64k fine grained
>processors. This should scale nicely to 1M processors in the next decade.
>After that we can expect some kind of new barriers to come up.

I admire your confidence, but am unconvinced. Evidence?

victor

Shahin Kahn

unread,

Oct 19, 1989, 1:13:31 AM10/19/89

to

In article <22...@brazos.Rice.edu> pre...@titan.rice.edu (Preston Briggs) writes:
>In article <90...@batcomputer.tn.cornell.edu> ka...@tcgould.tn.cornell.edu writes:
>>Throw away ALL your copies of the LINPACK 100x100 benchmark if you
>>are interested in supercomputers. The 300x300 is barely big enough
>Danny Sorenson mentioned recently that linpack is sort of intended
>to show how *bad* a computer can be. The sizes are kept
>deliberately small so that the vector machines barely have a chance
>to get rolling.

It certainly is biased towards micros with limited memory and is
absolutely irrelevant as a *supercomputer* application. Yes, it
can show how bad a supercomputer can be. I dont believe, however,
that it was *intended* to do that. My theory is that the size was
set smallish because a 100x100 matrix wasnt considered so small
back then, and the algorithm is suboptimal because that's what
level-1 BLAS and the LINPACK library did back then. There were
people who were implementing equivalents of level-2 BLAS and even
the emerging LAPACK kernels, but those didnt get blessed by the
Linpack benchmark until the 300x300 and finally the 1000x1000 were
included.

DONT compare supercomputers with the 100x100 linpack.
If you must use linpack 100x100, use linpack 300x300! And then
submit 37 copies of the program at once and see which machine
does 37 copies the fastest, and which one allows you to use the
keyboard while its running them!!

Read JackWorlton's paper in SupercomputerReveiw of Dec.88 or Jan.89,
and then if you still want to use linpack, at least you'll do it
with better knowledge.

Herman Rubin

unread,

Oct 19, 1989, 8:20:18 AM10/19/89

to

In article <3...@argosy.UUCP>, i...@argosy.UUCP (Ian L. Kaplan) writes:
> In article <20...@princeton.Princeton.EDU> m...@notecnirp.edu (Michael Golan) writes:

.......................

> Even n-cube machines run applications like Monte Carlo simulation
> with _much_ better price performance than supercomputers. Now it
> might be claimed that this is a special class of applications.
> However parallel processors are not limited to n-cubes. The
> Connection Machine has beaten Cray machines on a number of classic
> vectorizable codes (e.g., fluid flow). For reference see "Proceedings
> of the Conference on Scientific Applications of the Connection
> Machine", 1988, Edited by H. D. Simon, World Scientific press. Note
> that the Connection Machine is probably less than half the cost of the
> Cray. I am sure that even cheaper SIMD processors will appear in the
> near future.

......................

All computationally efficient means of generating non-uniform random numbers
involve what are called acceptance-rejection or acceptance-replacement methods.
These are most easily done on stream vector machines, and next on machines
which have at least (vector-register)-memory transfer with non-rigid vectors,
that is, moves in which the order of the items moved is fixed, but very
definitely not the locations. Not all vector machines have this capability,
and replacement is not vectorizable.

The problem is worse with MIMD, although something is salvageable, but
SIMD suffers from intrinsic problems. If the replacement procedure could
be added to hardware, SIMD would only suffer a moderate penalty.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hru...@l.cc.purdue.edu (Internet, bitnet, UUCP)

Wm E Davidsen Jr

unread,

Oct 19, 1989, 8:39:04 AM10/19/89

to

In article <35...@lll-winken.LLNL.GOV>, bro...@vette.llnl.gov (Eugene Brooks) writes:

| The MAJOR problem with current micros
| in a scalable shared memory environment is their 32 bit addressing.

| Unfortunately, no 4 processor system will ever need more than 32 bit
| addresses, so we will have to BEG the micro vendors to put in bigger
| pointer support..

The Intel 80386 has 32 bit segments, but its still a segmented system,
and the virtual address space is (I believe) 40 bits. The *physical*
space is 32 bits, though. The 586 has been described in the press as a
64 bit machine. Seems about right, the problem which people are seeing
right now is that file size is getting over 32 bits, and that makes all
the database stuff seriously ugly, complex, and subject to programming
error.

I think you can assume that no begging will be needed, but if you let
the vendors think that you need it the price will rise ;-)

John D. McCalpin

unread,

Oct 19, 1989, 9:06:41 AM10/19/89

to

In article <90...@batcomputer.tn.cornell.edu> ka...@tcgould.tn.cornell.edu
writes:
>Throw away ALL your copies of the LINPACK 100x100 benchmark if you
>are interested in supercomputers. The 300x300 is barely big enough

In article <22...@brazos.Rice.edu> pre...@titan.rice.edu (Preston Briggs)
writes:
>Danny Sorenson mentioned recently that linpack is sort of intended
>to show how *bad* a computer can be. The sizes are kept
>deliberately small so that the vector machines barely have a chance
>to get rolling.

In article <90...@batcomputer.tn.cornell.edu> ka...@batcomputer.tn.cornell.edu

(Shahin Kahn) writes:
>It certainly is biased towards micros with limited memory and is
>absolutely irrelevant as a *supercomputer* application. Yes, it
>can show how bad a supercomputer can be.

Well, I'll through in my $0.02 of disagreement with this thread. It
has been my experience that the poor performance of the LINPACK
100x100 test on supercomputers is *entirely typical* of what users
actually run on the things. There a plenty of applications burning up
Cray, Cyber 205, and ETA-10 cycles which have average vector lengths
*shorter* than the average of 66 elements for the LINPACK test, and
which are furthermore loaded down with scalar code.

The 100x100 test case is not representative of *everyone's* jobs, but
it is not an unreasonable "average" case, either. I think it is much
more representative of what most users will see than the 1000x1000
case, for example. The 1000x1000 case is a very good indicator of
what the *best* performance will be with *careful optimization* on
codes that are essentially 100% vectorizable. Most of the
supercomputer workload does not fall into that category....

Jeff d'Arcy

unread,

Oct 19, 1989, 9:24:52 AM10/19/89

to

From article <76...@bunny.GTE.COM>, by hh...@GTE.COM (Horace Dediu):

> Who cares about shared memory? Distributed is the only way to scale.

I'd like to see you back this one up with some *real* proof. I could
possibly agree with the statement that distributed is the *best* way
to scale, or that distribution is necessary for *large* (>~30) scale
multiprocessing. I think that shared memory architectures will still
be viable for a long time, perhaps as a component of a distributed
environment. If you disagree please provide reaasons.

Jeff d'Arcy jda...@encore.com "Quack!"
Encore has provided the medium, but the message remains my own

Robert Cousins

unread,

Oct 19, 1989, 9:28:03 AM10/19/89

to

In 1947, John Von Neumann anticipated that 4K word of 40 bits each was enough
for contemporary problems and so the majority of machines then had that much
RAM (or what passed for it in the technology of the day). This is ~2**12
bytes worth of usefulness in todays thinking (though not in bits). Over the
next 40 years we've grown to the point where 2**32 bytes is a common theoretical
limit for machines with a large number of machines in the 2**30 bytes is
fairly common. This translates into 18-20 bits of address over 40 years.
Or, 1 bit of address every 2 years or so.

Given the trend to having micro architectures last 5 to 8 years, this means
that a micro architecture should have atleast 4 additional address lines
at its announce or 5 additional when its development is started.

In the PC space, 16 megabytes seems to be the common upper limit. Any PC
therefore should have not 2**24 as a limit but 2**26 at the minimum.

IMHO, at least :-)

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

Chris Tomlinson

unread,

Oct 19, 1989, 9:57:32 AM10/19/89

to

From article <20...@princeton.Princeton.EDU>, by m...@notecnirp.Princeton.EDU (Michael Golan):

> In article <76...@bunny.GTE.COM> hh...@GTE.COM (Horace Dediu) writes:
>>
>>Consider the 8k processor NCUBE 2--"The World's Fastest Computer."
>>(yes, one of those). According to their literature:
>>"8,192 64 bit processors each equivalent to one VAX 780. It delivers
>>60 billion instructions per second, 27 billion scalar FLOPS, exceeding the
>

> This imply a VAX 780 is a 7 mips machine ?

The architecture of the processor is similar to the VAX ISA, not the performance.

>
>>performance of any other currently available or recently announced
>>supercomputer." It's distributed memory .5MB per processor, runs UNIX,

> ^^^^^^^^^^^^^^^^^^^^^^
>>and is a hypercube.
>
> .5MB ? And this is faster than a Cray? How many problems you can't even

I understand that NCUBE makes provisions for up to 64MB per node on
those systems using the 64 bit processors. They also apparently have
incorporated a through-routing capability in the processors similar to
that found on the Symult mesh-connected machines.

> solve on this? And for how many, a 32Mb single VAX 780 will beat ?!
> One of the well known problems wtih Hypercubes is that if you look at a job
> that uses the whole memory (in this case 4Gb = Big Cray), a single machine
> with the same performance of one processor (and all memory) will be almost
> as good and sometimes even better.

The current trends in distributed memory MIMD machines are towards very
low communication latencies by comparison with the first generation
machines that used software routing and slow communication hardware.
This has a tendency to drive the machines more towards shared-memory
like access times, but of course physical limitations simply mean that
DM-MIMD machines are a scalable way of approximating shared-memory worse
and worse as the machine gets larger, but at least the machine can get
larger.

>
> My original point was that MIMD, unless it has shared memory, is very hard
> to make use of with typical software/algorithms. Some problems can be solved
> nicely on a Hypercube, but most of them can not! And the state of the art

The state-of-the-art in parallel algorithm development is advancing rapidly
as machines become available to experiment on. It is more of an issue of
algorithm design than paralyzing sequential codes. There are quite a
number of problems that are tackled on Crays because of superior scalar
performance that do not make significant use of the SIMD vector capabilities.
I would point to the development of BLAS-2 and -3 as indications that even
on current supercomputers compiler technology just doesn't carry the day by
itself.

> in compilers, while having some luck with vectorized code, and less luck
> with shared memory code, has almost no luck with message-passing machines.
>
>
> Michael Golan
> m...@princeton.edu
> My opinions are my own. You are welcome not to like them.

Chris Tomlinson
tom...@MCC.COM
--opinions....

John Hascall

unread,

Oct 19, 1989, 10:29:17 AM10/19/89

to

In article <???> b...@odeon.ahse.cdc.com (Bruce Albrecht) writes:

}In article <???>, bro...@vette.llnl.gov (Eugene Brooks) writes:
}> Unfortunately, no 4 processor system will ever need more than 32 bit

}Oh really? CDC has several customers that have databases that exceed 2**32

...

}We already need pointers larger than 32 bits. IBM's AS400 has a virtual
}address space greater than 32 bits, too.

I don't know about CDC, but the AS/400 uses what is called Single Level
Storage, that is, all memory and disk are in one humongous address space.

Many people do require more than 2**32 bytes of disk farm, but very few
people are using 2**32 bytes of memory space--so in a more typical system
the need for (pointers) more than 32 bits is rather uncommon %

John Hascall

% although I'm sure we'll hear from a number of them now :-)

David Callahan

unread,

Oct 19, 1989, 10:43:50 AM10/19/89

to

In article <76...@bunny.GTE.COM> hh...@GTE.COM (Horace Dediu) writes:

>Who cares about shared memory? Distributed is the only way to scale.

Perhaps you forgot a smiley? Or perhaps when you say "shared" you mean
"centralized"?

"Shared" memory is part of the virtual machine and clearly can be
implemented on a machine with "distributed" packaging of memory with
processors. The BBN Butterfly and the RP3 are both "distributed"
memory machines in the sense that memory is packaged with processsors
and the hardware takes care of building "messages" for every memory
request.

From a programing point of view, machines like the NCUBE have three
(IMHO) serious faults: message passing is done in software and is
therefore has orders of magnitude more latency than a "shared" memory
machine; data movement now requires software-controlled cooperation on
both processors; and finally, the programmer must determine the
location of the "most recent" value of every variable and which
processor was the last to write it or next to use it.

I care about shared memory --- its makes parallel machines much easier
to program.

>Everybody realizes this since it can be proven.

>The only reason shared memory machines exist is because we don't yet know
>how to make good distributed machines. (Yeah, right! tell that to Ncube)
>IMHO shared memory is a hack using available bus technology while waiting for
>the real parallel machines to come. (they're already here)

Shared memory has nothing to do with busses --- it has to do with
programming.

>Horace Dediu \"That's the nature of research--you don't know |GTE Laboratories
>(617) 466-4111\ what in hell you're doing." `Doc' Edgerton |40 Sylvan Road
>UUCP: ...!harvard!bunny!hhd0................................|Waltham, MA 02254
>Internet: hh...@gte.com or hhd0%gte...@relay.cs.net..........|U. S. A.

Disclaimer: I work for a company designing a multiprocessor that
supports shared memory programming.

--
David Callahan (da...@tera.com, da...@june.cs.washington.edu,da...@rice.edu)
Tera Computer Co. 400 North 34th Street Seattle WA, 98103

Joel Lessenberry

unread,

Oct 19, 1989, 12:06:18 PM10/19/89

to

In article <16...@atanasoff.cs.iastate.edu> has...@atanasoff.UUCP (John Hascall) writes:
>...
>}We already need pointers larger than 32 bits. IBM's AS400 has a virtual
>}address space greater than 32 bits, too.
>
> I don't know about CDC, but the AS/400 uses what is called Single Level
> Storage, that is, all memory and disk are in one humongous address space.

> John Hascall
>

is anyone else out their interested in starting an AS/400 thread?

It is IBM's most advanced system..

Single level storage
Object Oriented Arch
Context addressing
Hi level machine Instruction set
64 bit logical addressing
True complete I/D split, no chance for self modifying
code

joel

Joel Lessenberry, Distributed Systems | +1 313 948 3342
jo...@cfctech.UUCP | Chrysler Financial Corp.
joel%cfctec...@mailgw.cc.umich.edu | MIS, Technical Services
{sharkey|mailrus}!cfctech!joel | 2777 Franklin, Sfld, MI

William Tsun-Yuk Hsu

unread,

Oct 19, 1989, 1:20:50 PM10/19/89

to

In article <2...@dg.dg.com> ch...@dg.dg.com (Chris Moriondo) writes:
>
>The only really scalable interconnect schemes of which I am aware are
>multistage interconnects which grow (N log N) as you linearly increase the

>numbers of processors and memories...

>
>While message passing multicomputers maximize programmer effort in the sense
>that they don't lend themselves to "dusty deck" programs, they have the
>advantage that the interconnect costs scale linearly with the size machine.

Ummm, message passing does not necessarily mean a single-stage
interconnect. Also, most commercial message passing systems these
days are hypercubes, and it's oversimplifying to claim that the
cost of the hypercube interconnect scales linearly with system size.
Remember that there are O(logN) ports per processor. Check out the
paper by Abraham and Padmanabhan in the '86 International Conference
on Parallel Processing, for another view on interconnect cost
and performance comparisons.

Most point-to-point parallel architectures, where the fan-out per
processor also grows linearly with the system size, tend to be
things like rings and meshes that are less popular for more
general purpose parallel computing. Are you referring to these
rather than hypercubes?

Bill Hsu

Eugene Brooks

unread,

Oct 19, 1989, 3:32:45 PM10/19/89

to

In article <90...@batcomputer.tn.cornell.edu> ka...@tcgould.tn.cornell.edu writes:

>The Cray-XMP is considerably slower than the YMP.

The YMP is 30% faster than a the XMP I was referring to. This is
for scalar dominiated compiled code and is a rather general result.
Just in case you doubt my sources, I runs codes on both a YMP 8/32
and an XMP 4/16 frequently enough to be a good judge of speed.

>The single-processor XMP is no-longer a supercomputer.

Only if the difference between supercomputer and not is a 30% speed increase.
I argue that a 30% speed increase is not significant, a frigging factor of
2 is not significant from my point of view. Both the XMP and the YMP
are in the same class. Perhaps later YMPs will have more memory putting
them in a slightly improved class.

>Take a program requiring more than 128MBytes of memory (or 64 MBytes
>for that matter (but I personally prefer more than 256M to excerice the
>VM system alittle!)) (i.e. a relatively BIG job, a *supercomputer* job)
>and then compare any micro you want
>or any other system you want with the YMP. or something in
>that class. and then try it on a multiprocessor YMP, and Please
>STOP USING A SINGLE-PROCESSOR xmp AS THE DEFINITION OF A SUPERCOMPUTER,
>thank you.

I have no interest in single cpu micros with less than 128MB.
I prefer 256 MB. I want enough main memory to hold my problems.

>And it would be nice if people used "LIST PRICE" for "COMPLETE SYSTEMS"
>when comparing prices. (LIST PRICE = PEAK PRICE !!) (COMPLETE SYSTEM =
>with all needed software and a few GBytes of disk with a few controllers)

I am talking list price for the system. A frigging XMP eating micro with
suitable memory, about 64 meg at the minimum, can be had for 60K. The
YMP costs about 3 million a node. The micro matches its performance for
my applications. Which do you think I want to buy time on? Of course,
I prefer a 3 million dollar parallel micro based system which has 50-100
nodes and runs circles around the YMP processor for my application.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Eugene Brooks

unread,

Oct 19, 1989, 4:02:40 PM10/19/89

to

In article <MCCALPIN.89...@masig3.ocean.fsu.edu> mcca...@masig3.ocean.fsu.edu (John D. McCalpin) writes:

>I think that it is interesting that you expect the same users who
>can't vectorize their codes on the current vector machines to be able
>to figure out how to parallelize them on these scalable MIMD boxes.

I can only point out specific examples which I have experience with.
For certain Monte Carlo radiation transport codes, vectorization is a
very painful experience which involves much code rewriting to obtain
meager performance increases. I have a direct experience with such
a vectorization effort on a "new" and not dusty deck code. We got
a factor of 2 as the upperbound for performance increases from vectorization
on the XMP. The problem was all the operations performed under masks.
LOTS of wasted cycles. The same problem, however, was easily coded
in an EXPLICITLY PARALLEL language and obtained impressive speedups
of 24 out of 30 processors on a Sequent Symmetry. It ran at 2.8 times
XMP performance on hardware costing much less. We are moving on to
a 126 processor BBN Butterfly-II now which should deliver more than
40 times the performance of the XMP at similar system cost.

>It seems to me that the automatic parallelization problem is much
>worse than the automatic vectorization problem, so I think a software
>fix is unlikely....

Automatic vectorization is much easier than automatic parallelization
in a global sense. This is why high quality vectorizing compilers
exist, in addition to the high availability of hardware, and why
automatic GLOBALLY parallizing compilers dont. The problem with some
codes is that they must be globally parallelized, and right now an
expliticly parallel lingo is the way to get it done.

>In fact, I think I can say it much more strongly than that:
>Extrapolating from current experience with MIMD machines, I don't
>think that the fraction of users that can use a scalable MIMD
>architecture is likely to be big enough to support the economies of
>scale required to compete with Cray and their vector machines. (At
>least for the next 5 years or so).

I do not agree, LLNL (a really big user of traditional supercomputers)
has hatched the Massively Parallel Computing Initiative to achieve
this goal on a broad application scale within 3 years. We will see
what happens...

>What is driving the flight from traditional supercomputers to
>high-performance micros is turnaround time on scalar codes. From my
>experience, if the code is really not vectorizable, then it is
>probably not parallelizable either, and scalable machines won't scale.

Not true, I have several counter examples of highly parallel but scalar codes.

>The people who can vectorize their codes are still getting 100:1
>improvements going to supercomputers --- my code is over 500 times
>faster on an 8-cpu Cray Y/MP than on a 25 MHz R-3000/3010. So the
>market for traditional supercomputers won't disappear, it will just be
>more limited than many optimists have predicted.

Yes, using all 8 cpus on the YMP and if each cpu is spending most of
its time doing 2 vector reads, a multiply and an add, and one vector
write, all chained up it will run circles around the current killer
micros which are tuned for scalar performance. This situation will
change in the next few years.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Rodney Doyle Van Meter III

unread,

Oct 19, 1989, 5:33:55 PM10/19/89

to

In article <4...@ctycal.UUCP> ingo...@ctycal.UUCP (Terry Ingoldsby) writes:
>
>Note that I'm not saying to give up on parallel processing; on the contrary
>I believe that it is the only way to do certain activities. I am saying
>that the notion of a general purpose massively parallel architecture that
>efficiently executes all kinds of algorithms is probably a naive and
>simplistic view of the world.

Depends on how you classify "all" algorithms. Nary a machine ever made
is good at every algorithm ever invented.

I suspect fine-grain SIMD machines are the way to go for a broader
class of algorithms than we currently suspect. Cellular automata,
fluid flow, computer vision, certain types of image processing and
computer graphics have all shown themselves to be amenable to running
on a Connection Machine. I'm sure the list will continue to grow. In
fact Dow Jones himself now owns two; anybody know what he's doing with
them?

Peak performance for a CM-2, fully decked out, is on the order of 10
Gflops. This is with 64K 1-bit processors and 2K Weitek FP chips. The
individual processors are actually pretty slow, 10-100Kips, I think.
Imagine what this baby'd be like if they were actually fast! Their
Datavault only has something like 30MB/sec transfer rate, which seems
pretty poor for that many disks with that much potential bandwidth.

Rumors of a CM-3 abound. More memory (1 Mbit/processor?), more
processors (I think the addressing for processors is already in the
neighborhood of 32 bits), more independent actions perhaps going as
far as local loops, etc.

I was told by a guy from Thinking Machines that they get two basic
questions when describing the machine:

1) Why so many processors?

2) Why so few processors?

Answering the second one is easy: It was the most they could manage.
Answering the first one is harder, because the people who ask tend not
to grasp the concept at all.

What do I think? I think the next ten years are going to be very
interesting!

--Rod

Eugene Brooks

unread,

Oct 19, 1989, 5:51:00 PM10/19/89

to

In article <2...@dg.dg.com> ch...@dg.dg.com (Chris Moriondo) writes:

>The only really scalable interconnect schemes of which I am aware are
>multistage interconnects which grow (N log N) as you linearly increase the
>numbers of processors and memories. So in the limit the machine is essentially
>ALL INTERCONNECT NETWORK, which obviously costs more than the processors and
>memories. (Maybe this is what SUN means when they say "The Network IS the
>computer"? :-) How do you build a shared-memory multi where the cost of the
>interconnect scales linearly? Obviously I am discounting busses, which don't
>scale well past very small numbers of processors.

The cost of the interconnect can't be made to scale linearly. You can
only get a log N scaling per processor. The key is the base of the log
and not having N too large, ie using a KILLER MICRO and not a pipsqueak.
Eight by eight switchnodes are practical at this point, with four by four
being abslolutely easy. Pin count is the main problem, not silicon area.
Assuming 8x8 nodes, a 512 node system takes three stages, a 4096 node
system takes 4 stages. Are 4 switch chips cheaper, or equivalent in
cost to a killer micro and 32 meg of memory?

SUNS "The network is the computer" is meant for ethernet types of things
but it really does apply to multiprocessors. If you don't have real good
communcation capability between the computing nodes what you can do with
the machine is limited.

Could anyone handle a KILLER MICRO powered system with 4096 nodes?
Just think, 4096 times the power of a YMP for scalar but MIMD parallel
codes. ~400 times the power of a YMP cpu for vectorized and MIMD
parallel codes. It boggles the mind.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Eugene Brooks

unread,

Oct 19, 1989, 10:37:07 PM10/19/89

to

>Assuming 8x8 nodes, a 512 node system takes three stages, a 4096 node
>system takes 4 stages. Are 4 switch chips cheaper, or equivalent in
>cost to a killer micro and 32 meg of memory?

Oops! It should be, are 4 switch chips cheaper than 8 killer micros and
256 Meg of memory. The switch is 4 stages deep, but there are 8 micros
hung on each switch port. The bottom line is that the switch is probably
not more than half the cost of the machine, even given the fact that it
is not a commodity part. Of course, a good design for the switch chip
and node interface might become a commodity part! Depending on the cache
hit rates one might hang more than one micro on each node and further
amortize the cost of the switch.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Donald Lindsay

unread,

Oct 19, 1989, 10:48:01 PM10/19/89

to

Since I care about parallel machines, my two cents worth:

FACT: we have proof-by-existence that

- massively parallel machines can be built, can be reliable, etc.
(defining massive as "more than 1000 processor chips").
- they can have aggregate properties ( GIPS, GFLOPS, GB, GB/s IO)
that are in the supercomputer league. Yes, I have details.
- they allow memory-intensive algorithms, since they can use, in main
memory, the slower/cheaper DRAMs that Cray uses only in backing
memory. Yes, I can back this up.
- for selected large applications, these machines already are the
fastest hardware, and the cheapest hardware. Yes, both.
- for selected applications, these machines aren't that hard to program.

IT SEEMS AGREED THAT:

- MIMD machines can have automatic load balancing, timesharing, etc.
(Actually, the timesharing is called "spacesharing".)
- MIMD machines with virtual memory could conveniently fault pages
around between nodes.
- some applications could use a Connection Machine with millions
of processors.
- programming isn't as easy as we'd like.

RESEARCHERS HAVE FOND HOPES THAT, SOME DAY,

- automatic parallelization onto these machines will become practical
for many applications.
- shared-memory/cache-coherency will be cost-effective on large MIMDs.

MY OPINION:

- most supercomputer applications will wind up on massively
parallel machines.
- there will aways be a market for the fastest possible single CPU.
- MIMD machines don't want the fastest possible node, because
(so far) the money is better spent buying several cheaper nodes.
- conventional instruction set architectures are suitable bases for
MIMD nodes.
- "scaling laws" are sophistry when 8K node MIMDs are here now.

Sorry to be so wordy.
--
Don D.C.Lindsay Carnegie Mellon Computer Science

Keith Bierman - SPD Advanced Languages

unread,

Oct 19, 1989, 10:53:28 PM10/19/89

to

In article <33...@ames.arc.nasa.gov> lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:

>I agree with you here. In fact, did anyone notice a recent newspaper article
>(In Tuesday's Merc. News - from Knight Ridder:)
>
>"Control Data to use Mips design"
>
>"Control Data Corp. has cast its lot with Mips Computer Systems, inc. to design
>the brains of its future computers, choosing a new computer architecture
>developed by the Sunnyvale Company."

CDC has been selling the MIPS based SGI workstation under its label
for a while now ... so this is either total non-news ... or CDC has
simply decided to cut SGI out of the picture.

When I had a chance to play with the CDC labeled SGI box I couldn't
find _any_ differences from the SGI equivalent (except that the SGI
had a newer software release and different power up message).

Keith H. Bierman |*My thoughts are my own. !! kbie...@sun.com
It's Not My Fault | MTS --Only my work belongs to Sun*
I Voted for Bill & | Advanced Languages/Floating Point Group
Opus | "When the going gets Weird .. the Weird turn PRO"

"There is NO defense against the attack of the KILLER MICROS!"
Eugene Brooks

Rodger Lea

unread,

Oct 20, 1989, 5:27:17 AM10/20/89

to

From article <17...@cfctech.UUCP>, by jo...@cfctech.UUCP (Joel Lessenberry):

> It is IBM's most advanced system..
>
> Single level storage

^^^^^
At last !!

> Object Oriented Arch

What exactly do you/they mean by object oriented. Are we
talking something along the lines of the intel approach ?

I would be interested in details - anybody in the know ?

Rodge

rod...@chorus.fr

mcdo...@uxe.cso.uiuc.edu

unread,

Oct 20, 1989, 10:57:00 AM10/20/89

to

>Well, I'll through in my $0.02 of disagreement with this thread. It
>has been my experience that the poor performance of the LINPACK
>100x100 test on supercomputers is *entirely typical* of what users
>actually run on the things. There a plenty of applications burning up
>Cray, Cyber 205, and ETA-10 cycles which have average vector lengths
>*shorter* than the average of 66 elements for the LINPACK test, and
>which are furthermore loaded down with scalar code.

>The 100x100 test case is not representative of *everyone's* jobs, but
>it is not an unreasonable "average" case, either. I think it is much
>more representative of what most users will see than the 1000x1000
>case, for example. The 1000x1000 case is a very good indicator of

I heartily agree with that: I have two projects that I tried on
a big supercomputer: one diagonalized zillions of 20x20 matrices,
the other wants to diagonalize a single 1000000x1000000 matrix.
Neither was suitable for a Cray!

Doug McDonald

Steve Warren

unread,

Oct 20, 1989, 11:59:30 AM10/20/89

to

In article <22...@brazos.Rice.edu> pre...@titan.rice.edu (Preston Briggs) writes:

>In article <90...@batcomputer.tn.cornell.edu> ka...@tcgould.tn.cornell.edu writes:
>

>>Throw away ALL your copies of the LINPACK 100x100 benchmark if you
>>are interested in supercomputers. The 300x300 is barely big enough
>

>Danny Sorenson mentioned recently that linpack is sort of intended
>to show how *bad* a computer can be. The sizes are kept
>deliberately small so that the vector machines barely have a chance
>to get rolling.
>

>So, big if you're optimistic; small otherwise.
>
>Preston Briggs

That is sort of like testing a dump-truck on a slalom course. A vector
machine should have balanced performance on scalar code, but you are
buying vector performance primarily.

So, big if you want to see if it can do what you are paying for, small
if you want to see if it can do anything else.

--Steve
-------------------------------------------------------------------------
{uunet,sun}!convex!swarren; swa...@convex.COM

Robert Munck

unread,

Oct 20, 1989, 12:50:37 PM10/20/89

to

In article <12...@crdos1.crd.ge.COM> davi...@crdos1.UUCP (bill davidsen) writes:
>
> The Intel 80386 has 32 bit segments, but its still a segmented system,
>and the virtual address space is (I believe) 40 bits.

You're both too high and too low. The 386 supports 16,384 segments of up
to 4GB, 14 bits plus 32 bits => 46 bit addresses. HOWEVER, the segments map
into either real memory (page translation disabled), maximum 4GB, or
linear virtual memory (paging enabled), also maximum 4GB. Virtual
addresses are 46 bits and the virtual address space is 4GB. I think it's cute.
-- Bob <Mu...@MITRE.ORG>, linus!munck.UUCP
-- MS Z676, MITRE Corporation, McLean, VA 22120
-- 703/883-6688

John Hascall

unread,

Oct 20, 1989, 12:53:46 PM10/20/89

to

In article <33...@chorus.fr> rod...@chorus.fr (Rodger Lea) writes:
}From article <17...@cfctech.UUCP>, by jo...@cfctech.UUCP (Joel Lessenberry):

}> It is IBM's most advanced system..

}> Object Oriented Arch

} What exactly do you/they mean by object oriented. Are we
}talking something along the lines of the intel approach ?

The AS/400 architecture makes the VAX architecture look
like RISC--it is *so* CISC!!

As I understand it, there are 2 levels of microcode.
Your instruction (I was told one instruction was
"create database") executes the top level of microcode
which in turn executes the bottom level of microcode
which in turn actually causes the hardware to do something.

Most unusual.

John

Mike Butts

unread,

Oct 20, 1989, 12:54:07 PM10/20/89

to

From article <10...@encore.Encore.COM>, by jda...@multimax.UUCP (Jeff d'Arcy):

> From article <76...@bunny.GTE.COM>, by hh...@GTE.COM (Horace Dediu):
>> Who cares about shared memory? Distributed is the only way to scale.
>
> I'd like to see you back this one up with some *real* proof. I could
> possibly agree with the statement that distributed is the *best* way
> to scale, or that distribution is necessary for *large* (>~30) scale
> multiprocessing. I think that shared memory architectures will still
> be viable for a long time, perhaps as a component of a distributed
> environment. If you disagree please provide reaasons.

Distributed architectures are obviously most desirable from a hardware point of
view, because they are simple and (nearly) arbitrarily scalable. IMHO there
are two reasons why shared memory architectures will continue to be more
important for many of us for a long time to come.

1) Shared memory systems are *much* easier to program, and software development
costs *much* more than hardware nowadays. I say this based on my experience as
a hardware engineer in a mostly software engineering environment.

2) Many problems have proven inefficient so far on distributed memory
architectures. Distributed machines succeed beautifully on problems where some
real physical space can be mapped onto processor/memory nodes which are
arranged in a regular topology which is similar to the topology of the problem.
Modeling physical media, such as solids, liquids or gases, takes advantage of
the fact that the state of one parcel only directly affects the state of its
nearby neighbors.

Problems with irregular topology, such as electronic circuits, are very much
harder to solve efficiently, because the state of one gate or transistor may
affect the state of another at a great distance. Communications is far more
irregular and expensive, so speedups suffer. Static load balancing among the
processors is also much harder. Shared architectures need not statically
partition the problem, and communication speed depends much less on distance.

I'm aware of new algorithmic technology being developed to attack these
problems, such as distributed time discrete event simulation techniques, but
there's still a lot of work ahead.

I agree that distributed memory is the only way to scale if you can, but there
are important problems which are much more readily solved on shared
architectures, at least so far. A hybrid architecture, with physically
distributed but logically shared memory, of which several examples have been
built, may be the best transition path.
--
Michael Butts, Research Engineer KC7IT 503-626-1302
Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton, OR 97005
!{sequent,tessi,apollo}!mntgfx!mbutts mbu...@pdx.MENTOR.COM
Opinions are my own, not necessarily those of Mentor Graphics Corp.

Hugh LaMaster

unread,

Oct 20, 1989, 1:37:40 PM10/20/89

to

In article <MCCALPIN.89...@masig3.masig3.ocean.fsu.edu> mcca...@masig3.masig3.ocean.fsu.edu (John D. McCalpin) writes:
>In article <90...@batcomputer.tn.cornell.edu> ka...@tcgould.tn.cornell.edu
>writes:
>>Throw away ALL your copies of the LINPACK 100x100 benchmark if you
>>are interested in supercomputers. The 300x300 is barely big enough
>
>In article <22...@brazos.Rice.edu> pre...@titan.rice.edu (Preston Briggs)
>writes:
>>Danny Sorenson mentioned recently that linpack is sort of intended
>>to show how *bad* a computer can be. The sizes are kept
>>deliberately small so that the vector machines barely have a chance
>>to get rolling.
>
>In article <90...@batcomputer.tn.cornell.edu> ka...@batcomputer.tn.cornell.edu
>(Shahin Kahn) writes:
>>It certainly is biased towards micros with limited memory and is
>>absolutely irrelevant as a *supercomputer* application. Yes, it
>>can show how bad a supercomputer can be.

I found this particularly amusing. As a longtime defender of Linpack, I have
often been accused of being biased towards big vector machines, because of
the sensitivity of Linpack to memory and FPU bandwidth, and, particularly,
the ability to stream from memory to FPU and back to memory. Now, this
happens to be a very important property of a CPU to effectively run many codes
which I have seen over the years. I never rate machines on the basis of Linpackin absolute terms, but you can tell a lot about a machine with low Linpack
numbers. I never could understand why people bought 11/780's, for example :-)

>Well, I'll through in my $0.02 of disagreement with this thread. It
>has been my experience that the poor performance of the LINPACK
>100x100 test on supercomputers is *entirely typical* of what users
>actually run on the things.

I agree that vector startup time is extremely important, and Linpack is a
fairly "nice" program with respect to average vector length, so if vector
startup time is so long as to slow it down significantly, this is significant
to users. On the other hand, the performance is not so poor as it once was.
See below.

> There a plenty of applications burning up
>Cray, Cyber 205, and ETA-10 cycles which have average vector lengths
>*shorter* than the average of 66 elements for the LINPACK test, and
>which are furthermore loaded down with scalar code.

I note, at this point, that the ~7 ns (~142 MHz) ETA10G achieved the fastest
single processor Linpack score of 93 MFLOPS, or, .65 FLOPs/cycle. The
Cyber 205, using earlier compilers, achieved only 17 MFLOPS, on a 20 ns
clock, or, .34 FLOPs/cycle. The Cray Y-MP gets .50 FLOPs/cycle, while
the Cray 1/S (in 1983) got only .15 FLOPs/cycle. The same Cray 1/S today
gets .34 FLOPs/cycle. (It has less memory bandwidth than the Cray X-MP
and Y-MP, so you can see this effect clearly.) The Cray XYs and ETA machines
are capable of achieving around 2 FLOPs/cycle in hardware. My point is
that there has been considerable improvement in both hardware and software
and startup time penalties have been correspondingly reduced.

What is the relevance of Linpack today? Well, it still has *some* of the
same significance that it always had, but tells less than it used to. When
caches were small, you could extrapolate the 100x100 results to bigger jobs
without worrying. On the big iron, your performance went *up* with larger
problem sizes, so even if 300x300x300 was typical of your problem, you knew
what to expect. Now, with 100x100 fitting in some small caches, you need to
run a bigger job to make sure performance doesn't go *down* dramatically.
(Which it does on some micro based systems, of course.) On the other hand, if
you switch to 300x300, you lose the information contained in the 100x100 case
wrt startup time. So, good numbers tell you even less than they did before,
but bad numbers, in a sense, tell you even more, for the same reason.
I wouldn't buy a machine with a bad Linpack result to do these kinds of
problems, but I would look hard at the set of machines with good results,
and would look further, to see which one was the best for the job at hand.

Sometimes I use a "grep" benchmark just for fun. The Cray Y-MP still greps
faster than any other machine I have tested, but, I agree, it isn't the
world's most cost effective grepper out there :-) As with all benchmarks,
you have to be careful not to fool yourself... I would guess that an amd29000
based system might be the fastest on that particular test.

Hugh LaMaster, m/s 233-9, UUCP ames!lamaster
NASA Ames Research Center ARPA lama...@ames.arc.nasa.gov
Moffett Field, CA 94035
Phone: (415)694-6117

Bruce Albrecht

unread,

Oct 20, 1989, 5:19:36 PM10/20/89

to

In article <126...@sun.Eng.Sun.COM>, khb%ch...@Sun.COM (Keith Bierman - SPD Advanced Languages) writes:
> CDC has been selling the MIPS based SGI workstation under its label
> for a while now ... so this is either total non-news ... or CDC has
> simply decided to cut SGI out of the picture.

As far as I know, CDC will still be selling SGI workstations. CDC will be
working with Mips directly to develop high-performance versions of the Mips
architecture.

Chris Moriondo

unread,

Oct 20, 1989, 5:24:54 PM10/20/89

to

In article <1989Oct19.1...@ux1.cso.uiuc.edu> h...@uicsrd.csrd.uiuc.edu (William Tsun-Yuk Hsu) writes:
>In article <2...@dg.dg.com> ch...@dg.dg.com (Chris Moriondo) writes:
>>
>>The only really scalable interconnect schemes of which I am aware are
>>multistage interconnects which grow (N log N) as you linearly increase the
>>numbers of processors and memories...
>>
>>While message passing multicomputers maximize programmer effort in the sense
>>that they don't lend themselves to "dusty deck" programs, they have the
>>advantage that the interconnect costs scale linearly with the size machine.
>
>Ummm, message passing does not necessarily mean a single-stage
>interconnect. Also, most commercial message passing systems these

>days are hypercubes...

Too right. I confess I was thinking more along the lines of the current
crop of fine-grained mesh-connected message-passing multicomputers that are
being worked on at CALTECH (Mosaic) and MIT (the Jelly-bean machine and
the Apiary.) At least with machines of this ilk you only pay message latency
proportional to how far you are communicating, rather than paying on every
(global) memory reference with the shared-memory approach. Some of the
hot-spot contention results indicate that the cost of accessing memory as
seen by a processor might bear little relationship to its own referencing
behavior.

>...and it's oversimplifying to claim that the

>cost of the hypercube interconnect scales linearly with system size.
>Remember that there are O(logN) ports per processor.

With hypercubes, what concerns me more than the scaling of the number of
ports is the scaling of the length of the longest wires, and the scaling of
the number of wires across the midpoint of the machine. (Unless of course
you can figure out a way to wire your hypercube in hyperspace... :-)

Jim Becker

unread,

Oct 20, 1989, 5:54:03 PM10/20/89

to

khb%ch...@Sun.COM (Keith Bierman - SPD Advanced Languages) writes:

->In article <33...@ames.arc.nasa.gov> lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:

->>In fact, did anyone notice a recent newspaper article
->>(In Tuesday's Merc. News - from Knight Ridder:)
->>
->>"Control Data to use Mips design"
->>
->>"Control Data Corp. has cast its lot with Mips Computer Systems, inc. to design
->>the brains of its future computers, choosing a new computer architecture
->>developed by the Sunnyvale Company."

In a slightly different vein, there was a reprint in the
Mercury from the New York Times about the very subject that is being
discussed in this thread. This was in the Thursday Mercury, Oct 19th,
Business section.

The upshoot of the article is that the next versions of the
chips from the like of Intel and Mips [and Sun] will eclipse the likes
of Cray. They discuss the development of RISC strategy and superscalar
as well as vector technology. Good reading.

I believe the side effect of this article in the New York
Times will be the exposition of this computer evolution to the people
that have the money and the budgets to buy equipment. The techies can
argue and argue with management, but when the bosses read it in the
Times it becomes relevant. Hopefully it will opens their minds to to
the potential described.

-Jim

--
Jim Becker / jcb%fri...@sun.com / Sun Microsystems
...these are my opinions, and even my id disagrees..

Jan Vorbrueggen

unread,

Oct 20, 1989, 11:14:45 PM10/20/89

to

In article <10...@venera.isi.edu> r...@venera.isi.edu.UUCP (Rodney Doyle Van Meter III) writes:
>In article <4...@ctycal.UUCP> ingo...@ctycal.UUCP (Terry Ingoldsby) writes:

>> ... I am saying

>>that the notion of a general purpose massively parallel architecture that
>>efficiently executes all kinds of algorithms is probably a naive and
>>simplistic view of the world.

>Depends on how you classify "all" algorithms. Nary a machine ever made
>is good at every algorithm ever invented.

I learned in school that it is hard to write a good numerical algorithm
(e.g., to solve differential equations), but fairly easy to find an example
which makes it stand in the rain. Maybe the same applies to building
computers :-)

Rolf

Sean Fagan

unread,

Oct 21, 1989, 2:41:07 AM10/21/89

to

In article <90...@batcomputer.tn.cornell.edu> ka...@tcgould.tn.cornell.edu writes:

>The Cray-XMP is considerably slower than the YMP.

>The single-processor XMP is no-longer a supercomputer.

>Take a program requiring more than 128MBytes of memory (or 64 MBytes
>for that matter (but I personally prefer more than 256M to excerice the
>VM system alittle!)) (i.e. a relatively BIG job, a *supercomputer* job)

What?! Uhm, excercising the Cray's VM system is definitely going to be an
interesting job -- Seymour doesn't *believe* in VM! (Well, anecdote has it
that he doesn't *understand* it 8-).)

I have mixed feelings about VM (as anybody who's seen more than three of my
postings probably realizes 8-)): on one hand, yes, getting page faults will
tend to slow things down. However, the system can be designed, from a
software point of view, in such a way that page faults will be kept to a
minimum. Also, having about 4 Gbytes of real memory tends to help. And,
face it, swapping programs in and out of memory can be a time-consuming
process, even on a Cray -- if you're dealing with 100+ Mword programs!

Other supercomputers have VM, of course. However, I have never gotten the
chance to play on, say, an ETA-10 to compare it to a Cray (I asked someone,
once, at FSU for an account, and I was turned down 8-)). My personal
opinion is that the machine is not as fast for quite a number of
applications, but having the VM might help it beat a Cray in Real-World(tm)
situations. Anybody got any data on that?

And, remember: memory is like an orgasm: it's better when it's real
(paraphrasing Seymour). 8-)

--
Sean Eric Fagan | "Time has little to do with infinity and jelly donuts."
se...@sco.COM | -- Thomas Magnum (Tom Selleck), _Magnum, P.I._
(408) 458-1422 | Any opinions expressed are my own, not my employers'.

Rick Stein

unread,

Oct 21, 1989, 1:01:38 PM10/21/89

to

Keywords:

In article <2...@dg.dg.com> ch...@dg.dg.com (Chris Moriondo) writes:

>In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks) writes:
>While message passing multicomputers maximize programmer effort in the sense
>that they don't lend themselves to "dusty deck" programs, they have the
>advantage that the interconnect costs scale linearly with the size machine.

Indeed, the "dusty deck" (aka toxic waste dump) is generally not organized
to exploit the linear scalable potential of the multicomputer. To my
knowledge, no university in the U.S. teaches how to create linear scalable
software, the cornerstone of multicomputers. Until the shared-memory
s/w engineering styles are abandonded, no real progress in multicomputing
can begin (at least in this country). Europe and Japan are pressing on
without (despite us).>
>chrism
--
Richard M. Stein (aka, Rick 'Transputer' Stein)
Sole proprietor of Rick's Software Toxic Waste Dump and Kitty Litter Co.
"You build 'em, we bury 'em." uucp: ...{spsd, zardoz, felix}!dhw68k!stein

Barry Shein

unread,

Oct 21, 1989, 7:49:30 PM10/21/89

to

Why oh why is are people searching around for a philosopher's stone of
computing architectures. This isn't science, this is aberrant
psychology.

There exists problems which are best run on (pick one or more) MIMD,
SIMD, scalar, vector-processors, and/or hybrids. There also exist
problems which don't care what they're run on, at least not much.

For example, time-sharing and typical database environments seem to
run very well on MIMD systems for very little re-programming effort
(in the case of time-sharing, none.) This is because of the large
granularity of the applications and their measuring of "runs well" as
mere response time.

There are other examples for MIMD and examples for SIMD and other
types of processors.

To say (MIMD/SIMD) processors are a bad idea because there exists some
large set of problems which are either impossible or very hard to
optimize for those architectures is so goddamn stupid it boggles the
mind.

MIMD processors are relatively easy and cheap to build out of mostly
commodity parts.

SIMD processors appear to be very, very good at certain classes of
problems which are very important to some people. Important enough
that they'll buy a SIMD box just to run those problems and tell people
with other problems that don't fit so well to go fly a kite (which is
the mathematically correct answer to them.)

We've had multi-processing and hardware optimizations almost since
computing began. What do you think makes a mainframe a mainframe?
Multi-processing, particularly in the I/O channels because mainframes
are bought for high I/O throughput. Most DP shops aren't CPU bound,
they're I/O bound, so they buy their CPUs in the channels.

It continues to astounds me how, particularly in the academic computer
science community, some dodo will stand in front of an audience, show
that there exists a class of problems which don't run well on parallel
architectures, and conclude that therefore parallel architectures are
bad.

What *frightens* me is that N people will sit in the audience and nod
their heads in agreement and go out to spread this gospel (as we've
seen on this list) instead of riding the dodo out on the first rail.

LOOK, there are folks out there using all these architectures and
winning big. Consider that before you attempt to prove on paper again
that it's impossible for a honeybee to fly.

What would be *useful* would be a taxonomy of algorithms classified by
how well (and difficulty of adaptation) to various architectures.

But it's so much easier to throw peanut shells from the bleachers.
--
-Barry Shein

Software Tool & Die, Purveyors to the Trade | b...@world.std.com
1330 Beacon St, Brookline, MA 02146, (617) 739-0202 | {xylogics,uunet}world!bzs

John D. McCalpin

unread,

Oct 22, 1989, 9:50:56 AM10/22/89

to

In article <1989Oct21....@world.std.com> b...@world.std.com

(Barry Shein) writes:
>Why oh why is are people searching around for a philosopher's stone of
>computing architectures. This isn't science, this is aberrant
>psychology.

While there may be some small component of "holy grail"-itis in this
thread, I think that most people are discussing a different problem
here. This isn't science --- it is technology plus market forces....

>There exists problems which are best run on (pick one or more) MIMD,
>SIMD, scalar, vector-processors, and/or hybrids. There also exist
>problems which don't care what they're run on, at least not much.

I don't think that anyone disputes this. The question is whether or
not each of these types of architectures can acquire a sufficient
market to be competitive with whatever architecture is selling best,
and which therefore has the most R&D money and the best economies of
scale.

>To say (MIMD/SIMD) processors are a bad idea because there exists some
>large set of problems which are either impossible or very hard to
>optimize for those architectures is so goddamn stupid it boggles the
>mind.

On the other hand, it is a perfectly reasonable thing to decide that
it is not worth my time to learn how to work with/program on/optimize
on a particular architecture because it is not likely to be
commercially successful. It is easy enough to be wrong about the
commercial success aspects, but it is an unavoidable question.

>It continues to astounds me how, particularly in the academic computer
>science community, some dodo will stand in front of an audience, show
>that there exists a class of problems which don't run well on parallel
>architectures, and conclude that therefore parallel architectures are
>bad.
>What *frightens* me is that N people will sit in the audience and nod
>their heads in agreement and go out to spread this gospel (as we've
>seen on this list) instead of riding the dodo out on the first rail.

I believe that it is far more common for people to conclude that
parallel architectures are not an effective approach for the class of
problems being discussed. Since many of us out here are _users_,
rather than designers, it is hardly surprising that we would downplay
the potential usefulness of architectures that are believed to be
unhelpful in our chosen work. This is quite a reasonable response --
it reminds me of democracy and enlightened self-interest and all of
that stuff.... :-)

>LOOK, there are folks out there using all these architectures and
>winning big. Consider that before you attempt to prove on paper again
>that it's impossible for a honeybee to fly.

It is a good point to note here that lots of the people who think that
parallel architectures are not useful in their field are wrong.

>But it's so much easier to throw peanut shells from the bleachers.

And so much more fun! ;-)
--
John D. McCalpin - mcca...@masig1.ocean.fsu.edu
mcca...@scri1.scri.fsu.edu
mcca...@delocn.udel.edu

Robert Claeson

unread,

Oct 22, 1989, 11:51:09 AM10/22/89

to

In article <1278...@acf4.NYU.EDU> ti...@acf4.NYU.EDU (Stephen Tihor) writes:

>ACutally all of DEC's new systems above the desktop are multiprocessors. Gotta
>keep moving up to avoid them there killer micro's.

A multiprocessor system is *not* the same thing as a parallel system. DEC's
systems are symmetric multiprocessing systems (if now only Ultrix could run
symmetrically multiprocessing) but *not* parallel systems. Don't confuse these
two concepts, folks. Almost all parallel systems can function as symmetric
multiprocessors as well. The inverse not always true.

--
Robert Claeson E-mail: rcla...@erbe.se
ERBE DATA AB

Piercarlo Grandi

unread,

Oct 22, 1989, 2:35:40 PM10/22/89

to

In article <17...@cfctech.UUCP> jo...@cfctech.UUCP (Joel Lessenberry) writes:
is anyone else out their interested in starting an AS/400 thread?

It is IBM's most advanced system..

Single level storage
Object Oriented Arch
Context addressing
Hi level machine Instruction set
64 bit logical addressing
True complete I/D split, no chance for self modifying
code

Rumours exist that the AS/400 (nee S/38) is the result of putting
Peter's Bishop dissertation (a landmark work) "Very large address
spaces and garbage collection", MIT TR 107, in the hands of the
same team that had designed the System/3 (arrgghh!). IMNHO the
S/38 is a poor implementation of a great design. That it still is
good is more a tribute to the great design than to the
implementation skills of the System/3 "architects".
--
Piercarlo "Peter" Grandi | ARPA: pcg%cs.abe...@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: p...@cs.aber.ac.uk

Shahin Kahn

unread,

Oct 22, 1989, 7:07:29 PM10/22/89

to

In article <4650...@uxe.cso.uiuc.edu> mcdo...@uxe.cso.uiuc.edu writes:
>I heartily agree with that: I have two projects that I tried on
>a big supercomputer: one diagonalized zillions of 20x20 matrices,
>the other wants to diagonalize a single 1000000x1000000 matrix.
>Neither was suitable for a Cray!

Perhaps you could use a different algorithm to diagonalize
many 20x20 matrices. It sounds like the kind of thing that
could be re-written with significant improvement.
Like I was going to say in response to Eugene, you cant get
behind the wheels of a Ferarri, go to the school zone, and
complain that you cant go faster than 25MPH!
If you have lots of scalar code (or code that doesnt run very fast
on a super), and if it can't be rewritten, all you are doing
is announcing *your* departure from supercomputers. There will
always be applications that need the combination of speed, memory size,
memory bandwidth, IO bandwidth, disk size, the number
of users supported, etc. that only supers provide. Yes, as advances
are made, a super will be just a smart way of integrating off-the-shelf
components rather than desiging the whole machine from scratch
(although, as someone else pinted out, there is something to be said
about total re-designs).
And dont assume for a second that those who make supers have not
noticed this!

But that's not really the point of this posting!! My real question is:

1) Is your 1000000x1000000 matrix dense? If not, how sparse is it?
2) How did you solve it?
3) On what machine did you solve it?
4) What is the application? (Its not a complex matrix, is it?!)

Finally, I dont see why *any* micro would be better at this one
than a super?!

mcdo...@uxe.cso.uiuc.edu

unread,

Oct 22, 1989, 9:36:00 PM10/22/89

to

In article <4650...@uxe.cso.uiuc.edu> mcdo...@uxe.cso.uiuc.edu writes:
>I heartily agree with that: I have two projects that I tried on
>a big supercomputer: one diagonalized zillions of 20x20 matrices,
>the other wants to diagonalize a single 1000000x1000000 matrix.
>Neither was suitable for a Cray!

Somebody replies:

>Perhaps you could use a different algorithm to diagonalize
>many 20x20 matrices. It sounds like the kind of thing that
>could be re-written with significant improvement.

The Cray does this just fine. It is just that cheaper computers are more
cost effective (even with the Cray doing the vectorizable part
efficiently) - but, at the present time, the more important reason
snot to use the Cray for things like this is accessibility ,
which is better for lesser machines.

>But that's not really the point of this posting!! My real question is:

>1) Is your 1000000x1000000 matrix dense? If not, how sparse is it?

It is dense. Totally dense. No element is (in principle) zero.

>2) How did you solve it?

Are you kidding? All I get is uproarious laughter. But, now
and then, it begins to look like hope in the next few years!

On the other hand, we have a theoretician here who works on
equally large, but very sparse (I don't know how "very") matrices. These
he does just fine. I don't really know how, as he is rather secretive,
but he does say that the methods are well known ones for sparse
matrices.

>3) On what machine did you solve it?

The guy with the sparse one does fine on Vaxes, Crays, and Mipss.

>4) What is the application? (Its not a complex matrix, is it?!)

Mine is quantum mechanics of vibrations. No, it is real symmetric.
But I need at least some of the eigenvectors, though the eigenvalues
would be good enough to start with. Actually, generating the elements
is a worse problem than the diagonalization, as it requires
doing as many multidimensional integrals as there are independent
matrix elements. I have done it up to 1000x1000, but the results
are too coarse grained to be comparable to experiments.

I used to do the same problem in classical mechanics. Now THERE
is a good problem for parallel computers: doing many systems
of differential equations in parallel - you get as many sets of data
as one wants, all executing the same instructions at the same time.
In other words, one would have say 100 or more different initial
conditions, all running the same set of 20-50 simultaneous
first order linear equations. This was done way back when, 1973-1975,
on the Illiac IV, which suited it perfectly. Only problem
was, the results only proved that classical mechanics won't
give correct results.

Doug McDonald

Eugene Brooks

unread,

Oct 22, 1989, 10:46:41 PM10/22/89

to

In article <27...@dhw68k.cts.com> st...@dhw68k.cts.com (Rick Stein) writes
a followup to something attributed to me, but 180 degrees out of phase
with my opinion on the great shared memory vs message passing debate:

>Indeed, the "dusty deck" (aka toxic waste dump) is generally not organized
>to exploit the linear scalable potential of the multicomputer. To my
>knowledge, no university in the U.S. teaches how to create linear scalable
>software, the cornerstone of multicomputers. Until the shared-memory
>s/w engineering styles are abandonded, no real progress in multicomputing
>can begin (at least in this country). Europe and Japan are pressing on
>without (despite us).>

The posting he quoted here was incorrectly attributed to me. It was in fact
someone's retort to something I wrote. Scalable shared memory machines, which
provide coherent caches (local memory where shared memory is used as such), are
buildable, usable, and cost effective. Some students and professors at
Caltech, which included someone by the name of Brooks before his rebirth into
the "real" world of computational physics, were so desperate for computer
cycles that they sidetracked the parallel computer industry by hooking up
a bunch of Intel 8086-8087 powered boxes together in a system with miserable
communication performance. Industry, in its infinite wisdom, followed their
lead by providing machines with even poorer communication performance.

When you quote, please be sure to get the right author when it is from a
message with several levels of quoting. I had something to do with the
message passing hypermania, but it is not my party line these days....

bro...@maddog.llnl.gov, bro...@maddog.uucp

Eugene Brooks

unread,

Oct 22, 1989, 11:28:13 PM10/22/89

to

In article <91...@batcomputer.tn.cornell.edu> ka...@batcomputer.tn.cornell.edu (Shahin Kahn) writes:
>Like I was going to say in response to Eugene, you cant get
>behind the wheels of a Ferarri, go to the school zone, and
>complain that you cant go faster than 25MPH!

I have never complained about not being able to go faster than 25MPH in a
school zone when driving to work in my 71 Vette. Last time I checked, my
Vette dusted them thar stiking Italian cars from a stop light.

>If you have lots of scalar code (or code that doesnt run very fast
>on a super), and if it can't be rewritten, all you are doing
>is announcing *your* departure from supercomputers.

The series of supercomputers made by Cray was for some time the fastest you
could get for either scalar or vector coding. LLNL in fact has preferred them
for their superior scalar and short vector performance. (These days, of course,
we are not so pure in our decision making. We prefer Cray machines for their
software compatibility with the Cray-1. Although we are starting to work on
the problem, we have not fully embraced the notion of portable operating
systems or even standard high level language models. We use carbon dating
to keep track of some of our dusty decks.) The Cyber 205 could provide more
performance on long vectors, so I guess by your reasoning that you would call
it a supercomputer and accuse anyone buying a Cray machine of departing from
the real supercomputers. I guess that world has decided to not use enough
REAL SUPERCOMUTERS, because CDC could not sell enough of them to keep the 205
and its children on the market. Japanese machines blow the doors off the Cray
machines these days (scalar or vector), but CRI is not worried about this.

Their real nightmares have KILLER MICROS in them.

JUST BECAUSE ONE FRIGGING COMPUTER HAS A HIGHER UPPER BOUND ON THE FLOATING
POINT RATE THAN ANOTHER DOES NOT MEAN THAT IT IS A BETTER COMPUTER, OR THAT
IT IS THE "REAL SUPERCOMPUTER". THE BOTTOM LINE IS HOW FAST DOES A MACHINE
RUN YOUR APPLICATION. MY APPLICATION HAPPENS TO FARE POORLY ON SIMD MACHINES
AND HAPPENS TO BE VERY EFFICIENT AND HIGHLY PARALLEL ON MIMD MACHINES WITH
PROCESSORS OPTIMIZED FOR SCALAR PERFORMANCE. BECAUSE OF THIS I AM ACCUSED
OF DEPARTING FROM SUPERCOMPUTING. THIS IS GIBBERISH. IF I FIND A MIMD MACHINE
WITH 100 "KILLER MICROS" COSTING THE SAME AS A YMP AND WHICH RUNS MY APPLICATION
100 TIMES FASTER, I HAVE SIMPLY REDEFINED THE NOTION OF SUPERCOMPUTER.

FOR MY APPLICATION

bro...@maddog.llnl.gov, bro...@maddog.uucp

George R Carr Jr

unread,

Oct 23, 1989, 12:53:50 AM10/23/89

to

In article <MCCALPIN.89...@masig3.ocean.fsu.edu> mcca...@masig3.ocean.fsu.edu (John D. McCalpin) writes:

> .... [software for]
>parallel supercomputers is depressingly immature. I think traditional
>moderately parallel machines (e.g. Cray Y/MP-8) will be able to handle
>existing scientific workloads better than 1000-processor parallel
>machines for quite some time....

I know of several problem domains where I strongly disagree. More than one
aerospace company is currently looking at 1000+ node parallel machines because
no Cray, ETA, NEC, or other 'conventional' machine can give them the time to
solution required. The major area of excitement with parallel machines is to
find the problems for which algorithms exist which are now computable which
are not otherwise computable.

George R Carr Jr internet: ca...@multiflow.com
Multiflow Computer, Inc. uucp: uunet!mfci!mfci-la!carr
16360 Roscoe Blvd., Suite 215 fax: (818)891-0395
Van Nuys, CA 91406 voice: (818)892-7172

Shahin Kahn

unread,

Oct 23, 1989, 1:26:56 AM10/23/89

to

In article <36...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks) writes:
>The YMP is 30% faster than a the XMP I was referring to. This is
>for scalar dominiated compiled code and is a rather general result.

If you have scalar dominated code that fits in a workstation's memory
and you dont want to run more than one job at a time, then you are right.
I am sure other users of the YMP will be happy to keep the machine busy
and get good 64-bit megaflops.

>>The single-processor XMP is no-longer a supercomputer.

>Only if the difference between supercomputer and not is a 30% speed increase.

I have little desire to defend or promote a YMP, but you cant run a scalar
code on a vector machine and complain, too! On the NASA benchamrks, which
I am sure some of this audience has seen, the YMP sustained over 1 GFlops.
THAT, is significantly faster than a single processor XMP.

REWRITE the code!! Or have someone do it for you (there was a company
that would get your code to run at least twice faster or your money back,
I forget the name and dont know them or anyone who does).

If they dont perform,
Throw away all the dusty decks. Refuse to use dusty-deck oriented code.
But if that's all the algorithm can do for now, then yes, use whatever
gives you the desired performance at the least life-time cost (not price!)

>I have no interest in single cpu micros with less than 128MB.
>I prefer 256 MB. I want enough main memory to hold my problems.

A 256 MB micro can cost you some. And not so little. And all that
for just one user. I am not sure the numbers come out. And how about IO
bandwidth and file-size. maybe your application doesnt need any.
Talk to a Chemist.
By the time micros become killers, they wont be micros anymore!

>I am talking list price for the system. A frigging XMP eating micro with

Yes. My comment about list-price was not directed at Eugene. Sorry.
I meant to emphasize the importace of using peak-price to go with
peak-performance (I have seen cases where the reported performance is on
a high-end machine, but the reported price is not!).

Grunwald Betr. Tichy

unread,

Oct 23, 1989, 10:47:44 AM10/23/89

to

I followed the articles for some time and want to mention some points.

1. Hardware costs are only a fraction of the cost. To do real big problems you
need lots of support software and you relie on it. So if you use a PC
you will have to write more code (or buy spezialised code at a high price)
and trust your version. This is hard because numeric mathematic is not so
easy as it seems and if your aircraft comes down or your bridge cracks, its
to late to blame yourself.

2. Parallel computers will need a Pascal (C,Modula,Ada,..) like language, which
can be compiled and run on a scalable architekture. Nobody wants to rewrite
all programs, if he gets more processors. It would be even better to have
it scaled at runtime, so the process runs faster, if the no other users want
the processors too.
I know only the Connection Machine doing that, and this machine is not as
general purpose as a Workstation. (What OS has the CM ? What Languages ?
Can you compile a CM program to work on other Computers ? (not simulated)

3. Some Problems are just to big for a PC. Even if you have a more
sophisticated system then the normal Primitiv Computer, there are a lot
of problems which have been scaled down to run on supercomputers. So further
downscaling is not possible without a substantial loss of accuracy.
(accuracy is not only the length of a floating point. Its how much points
can your grids have? What differential equations are possible ? What about
error control ? (Its useless getting wrong results faster. You have to know
about the error range.))

My opinion is, supercomputers will exist a long time in future and MICROS still
have a long way to go, to match the performance.
Most people comparing the power don't think of the background of the
numbercrunchers and that are lots of software packages and big disks to record
the results, what is a big part of the machines cost.

Don't get me wrong: I'm a Micro User (OS9-680x0) and I like it, but I know that
things are not so easy in the supercomputing area as some people might think.

Knut Grunwald, Raiffeisenstr. 8, 7555 Elchesheim-Illingen, West-Germany

Lawrence Crowl

unread,

Oct 23, 1989, 11:21:20 AM10/23/89

to

In article <76...@bunny.GTE.COM> hh...@GTE.COM (Horace Dediu) writes:
>Who cares about shared memory? Distributed is the only way to scale.

>Everybody realizes this since it can be proven. The only reason shared
>memory machines exist is because we don't yet know how to make good
>distributed machines. IMHO shared memory is a hack using available bus
>technology while waiting for the real parallel machines to come.

You are mixing two concepts --- memory architecture (as the processors see it)
and communication interconnect. Commonly available shared memory systems tend
to use a bus interconnect, so people assume that this is the only interconnect
for shared memory. This assumption is wrong. SHARED MEMORY DOES NOT IMPLY A
BUS INTERCONNECT. The BBN Butterfly, IBM RP3, and NYU Ultracomputer all
supported a shared memory implemented over an FFT interconnect without busses.
The Butterfly is commercially available with up to 512 processors. The RP3
is a research machine designed for as many as 512 processors, though I don't
know if IBM has configured one that large. I don't recall the Ultracomputer
size.

SHARED MEMORY IS SCALABLE. If the system supports a scalable interconnection,
and processors have local memory, then a shared memory system is scalable.
With local memory, only information that is truly shared need be communicated
between processors. This is exactly the information that must be communicated
on distributed memory system via message passing.

SHARED MEMORY IS DESIREABLE. The latency on remote memory access is typically
two orders of magnitured faster than message passing on distributed memory.
Applications with small messages performed on an infrequent bases will see
significant performance improvements. For instance, a shared memory system
can increment a shared counter far faster than any distributed memory system.

SHARED MEMORY HAS A COST. Implementing shared memory over a scalable
interconnect may require a larger aggregate bandwidth than that of distributed
memory systems.. I don't think there has been enough research here to know
the real tradeoff, but such a result would not suprise me.
--
Lawrence Crowl 716-275-9499 University of Rochester
cr...@cs.rochester.edu Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627

Eugene Brooks

unread,

Oct 23, 1989, 12:24:02 PM10/23/89

to

In article <91...@batcomputer.tn.cornell.edu> ka...@batcomputer.tn.cornell.edu (Shahin Kahn) writes:

>If you have scalar dominated code that fits in a workstation's memory

One should not attempt to infer that a workstation's memory is small.
An YMP 8/32 has 4 megawords (32 MB) available per processor. If all you
want is 32 MB per processor you can buy this with a killer micro for
about 40K, simply throw it away in a year when its performance has been
eclipsed by the next killer micro, and still have your computer time work
out to be about 5 dollars an hour. They have the gall to charge $250 an
hour for Cray YMP time, for low priority time at that.

>THAT, is significantly faster than a single processor XMP.
>
>REWRITE the code!! Or have someone do it for you (there was a company
>that would get your code to run at least twice faster or your money back,
>I forget the name and dont know them or anyone who does).

We did! And we showed that you could asymtotically get the factor of 2
you suggest with infinite work. Why suggest doing such a thing when
one can get a factor of 100 with little work on 100 killer micros?

>If they dont perform,
>Throw away all the dusty decks. Refuse to use dusty-deck oriented code.

This was not a dusty deck. This code was written in the last couple of
years with modern tooling, for both vectorized and MIMD parallel machines.
It is not the code which is scalar, it is the algorithm. One could say
toss out the algorithm, but it is one most robust ones available for the
application in question.

>A 256 MB micro can cost you some. And not so little.

But it is much cheaper than a SUPERCOMPUTER for my application,
and it is FASTER. To bring back the car analogy, the accelerator
is still pressed to the metal for speed improvements in killer micros.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Henry Spencer

unread,

Oct 23, 1989, 12:35:58 PM10/23/89

to

In article <74...@linus.UUCP> mu...@chance.UUCP (Robert Munck) writes:
>... The 386 supports 16,384 segments of up
>to 4GB, 14 bits plus 32 bits => 46 bit addresses...

Except that it's not a 46-bit address space, it's a bunch of 32-bit ones.
There is a difference. As witness the horrors that are perpetrated on
8086/88/186/286 machines to try to cover up their lack of a unified
address space. "Near" and "far" pointers, anyone?
--
A bit of tolerance is worth a | Henry Spencer at U of Toronto Zoology
megabyte of flaming. | uunet!attcan!utzoo!henry he...@zoo.toronto.edu

Anton Rang

unread,

Oct 23, 1989, 12:51:59 PM10/23/89

to

In article <36...@lll-winken.LLNL.GOV> bro...@vette.llnl.gov (Eugene Brooks) writes:
> They have the gall to charge $250 an
>hour for Cray YMP time, for low priority time at that.

You haven't got much reason to complain...out here I have the
privilege of spending $300/hour for VAX-11/785 time... :-)

Schools can be SO much fun....

Henry Spencer

unread,

Oct 23, 1989, 12:59:11 PM10/23/89

to

In article <27...@dhw68k.cts.com> st...@dhw68k.cts.com (Rick Stein) writes:
>...no university in the U.S. teaches how to create linear scalable

>software, the cornerstone of multicomputers. Until the shared-memory
>s/w engineering styles are abandonded, no real progress in multicomputing
>can begin (at least in this country). Europe and Japan are pressing on
>without (despite us).>

What remains to be seen is whether they are pressing on up a blind alley.
Remember where this discussion thread started out: the mainstream of
high-volume development has vast resources compared to the more obscure
byways. Results from those byways have to be awfully damned good if they
are going to be competitive except in ultra-specialized niches. As I've
mentioned in another context, "gonna have to change our whole way of thinking
to go parallel real soon, because serial's about to run out of steam" has
been gospel for quite a while now... but the difficulty of that conversion
has justified an awful lot of highly successful work on speeding up
non-parallel computing. Work which is still going and still succeeding.

I'm neutral on the nationalism -- you're all foreigners to me :-) -- but
highly skeptical on the parallelism.

Eugene Brooks

unread,

Oct 23, 1989, 1:01:04 PM10/23/89

to

In article <1989Oct23.1...@cs.rochester.edu> cr...@snipe.cs.rochester.edu (Lawrence Crowl) writes:
>SHARED MEMORY IS DESIREABLE. The latency on remote memory access is typically
>two orders of magnitured faster than message passing on distributed memory.

Given equivalent performance interconnect, which rarely occurs because the
message passing machines tend to get short changed on the communication hardware,
I have found the "shared memory" systems to have much better communication
performance. This is because the communication between processors is
directly supported in the memory management hardware. In the message passing
machines sending a message invokes a "kernel call" on both the sending and
recieving ends. This system call overhead is much greater than the hardware
latency itself, ammounting to a factor 5 or more. One could try for complex
hardware support of messaging, but a better solution is to just memory map it...

Please note: I am not talking about the really horrible interrupt handling
of message forwarding here. This only compounds a bad situation for kernel
overhead.

bro...@maddog.llnl.gov, bro...@maddog.uucp

Terry Ingoldsby

unread,

Oct 23, 1989, 2:40:03 PM10/23/89

to

In article <76...@bunny.GTE.COM>, hh...@GTE.COM (Horace Dediu) writes:
> The only reason shared memory machines exist is because we don't yet know

> how to make good distributed machines. (Yeah, right! tell that to Ncube)

> IMHO shared memory is a hack using available bus technology while waiting for

> the real parallel machines to come. (they're already here)
....
>
> Of course. To solve hard problems you *need* parallel execution.
> It's no secret that every big iron maker and every supercomputer shop is
> developing parallel machines. These are still modest efforts (<100 cpu's),
> but the leading egde is now in the 10k coarse grained, 64k fine grained
> processors. This should scale nicely to 1M processors in the next decade.
> After that we can expect some kind of new barriers to come up.

It is true that the only hope for the kind of performance improvement that will
be required for the next generation of software is parallel processing. It is
also true that many different parallel processing machines exist. Here is where
I start to be less optimistic about how quickly we can adopt parallel processing.

The basic problem is *NOT* the software. Although awkward to write, it can be
done and it is reasonable to assume that techniques will be developed as the
years go by. The (IMHO) fundamental problem is that different problems partition
differently. By this I mean that to be executed across many P.E.s (processing
elements) a problem must be partitioned. One criteria for partitioning the
problem is to subdivide it in such a way that the different P.E.s do as little
interP.E. communication as is possible. Generally speaking, the more you divide
the problem, the more P.E.s need to talk to each other. This, of itself, is
not necessarily bad (at least, it is tolerable). The problem is that different
classes of problems subdivide in different ways. Thus it is difficult to
set up any *efficient* communication strategy that will let P.E's talk to
other P.E.s (without going through lots of intermediary P.Es) for general
problems. Until someone addresses this issue, I am hard pressed to believe
that a *general purpose* parallel machine will be developed. I can easily
foresee a time when there will be a custom (parallel) vision processor in
a computer, a custom speech recognition processor, a custom database processor,
... and so on. I cannot see a given parallel architecture doing all of the
above.

My 2 cents!

--
Terry Ingoldsby ctycal!ingo...@calgary.UUCP
Land Information Systems or
The City of Calgary ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

King Su

unread,

Oct 23, 1989, 2:42:53 PM10/23/89

to

In article <36...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks) writes:

>...................................... Some students and professors at

<Caltech, which included someone by the name of Brooks before his rebirth into
>the "real" world of computational physics, were so desperate for computer
<cycles that they sidetracked the parallel computer industry by hooking up
>a bunch of Intel 8086-8087 powered boxes together in a system with miserable
<communication performance. Industry, in its infinite wisdom, followed their
>lead by providing machines with even poorer communication performance.

Huh? As far as I know, every commercially available multicomputer that
were built after our original multicomputer has a better communication
performance. We did not lead anybody into anything, as Caltech CS has
never been a strong influence to the industry. Nor have we advocated
low communication performances. Today's multicomputers are as much as
three orders of magnitude better in message latency and throughput,
thanks to worm-hole routing hardwares. There will be further
improvements when low-dimensional networks are in use.

Perhaps we could have provided more positive influences to the
industry, but we are operating under the guideline that university
research groups should not be turned into joint-ventures. The tax-
payers did not give us money for us to make more money for ourselves.
--
/*------------------------------------------------------------------------*\
| Wen-King Su wen-...@vlsi.caltech.edu Caltech Corp of Cosmic Engineers |
\*------------------------------------------------------------------------*/

Shahin Kahn

unread,

Oct 23, 1989, 3:11:45 PM10/23/89

to

In article <36...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks) writes:

>In article <91...@batcomputer.tn.cornell.edu> ka...@batcomputer.tn.cornell.edu (Shahin Kahn) writes:
>>If you have lots of scalar code (or code that doesnt run very fast
>>on a super), and if it can't be rewritten, all you are doing
>>is announcing *your* departure from supercomputers.
>The series of supercomputers made by Cray was for some time the fastest you
>could get for either scalar or vector coding. LLNL in fact has preferred them
>for their superior scalar and short vector performance. (These days, of course

Well, yes. these days you dont always buy a vector machine for its scalar
speed. The problem is really because of lack of modularity in some
vector machines, that you hhave to get the vector whether or not you like
it. There is a definite trend towards more modularity, where the vector
is an option.

In fact, I see no reason why a single scalar cant support more than one vector!

>we are not so pure in our decision making. We prefer Cray machines for their
>software compatibility with the Cray-1. Although we are starting to work on
>the problem, we have not fully embraced the notion of portable operating

Like I said, its a good idea to
REWRITE the code.
The LLNL problem is not a problem. If you are on the cutting edge of
things, sometimes you get cuts!

>to keep track of some of our dusty decks.) The Cyber 205 could provide more
>performance on long vectors, so I guess by your reasoning that you would call
>it a supercomputer and accuse anyone buying a Cray machine of departing from
>the real supercomputers. I guess that world has decided to not use enough

I will give you the benefit ot the doubt and not even respond to this.
Surely you don't really think that's what I meant.
I only used the YMP because you had used the XMP as an example.

>REAL SUPERCOMUTERS, because CDC could not sell enough of them to keep the 205
>and its children on the market. Japanese machines blow the doors off the Cray
>machines these days (scalar or vector), but CRI is not worried about this.

The ETA machine was/is a great machine and absoluttely first-rate technology.
The problem of ETA was not the machine or the architecture. It boiled down
to management even though they had a lot of good people at all levels.
It was too little too late, and a problemm of not having deep-enough
pockets and a lack of beleief, in the business community, in long-term....

> Their real nightmares have KILLER MICROS in them.

YOUR nightmares, maybe.

>
>JUST BECAUSE ONE FRIGGING COMPUTER HAS A HIGHER UPPER BOUND ON THE FLOATING
>POINT RATE THAN ANOTHER DOES NOT MEAN THAT IT IS A BETTER COMPUTER, OR THAT
>IT IS THE "REAL SUPERCOMPUTER". THE BOTTOM LINE IS HOW FAST DOES A MACHINE

All I asked was to request everyone to stop using the
single processor XMP as the definition. Indeed, a supercomputer has
many definitions and all of them are relative to some base characteristic.
As you well know.

>PROCESSORS OPTIMIZED FOR SCALAR PERFORMANCE. BECAUSE OF THIS I AM ACCUSED
>OF DEPARTING FROM SUPERCOMPUTING. THIS IS GIBBERISH. IF I FIND A MIMD MACHINE
>WITH 100 "KILLER MICROS" COSTING THE SAME AS A YMP AND WHICH RUNS MY APPLICATION
>100 TIMES FASTER, I HAVE SIMPLY REDEFINED THE NOTION OF SUPERCOMPUTER.
> FOR MY APPLICATION

BINGO! For Y O U R application! That's the whole point.
Funny we should start from a general red-alert about the attack of
killerr micros and end up where we all knew about: "It depends on
the application".

Yes, if your application is "pagemaker" you redefine supercomputing
by going to a mac-ii-ci !!

Now in this case, there is something to be said about highly parallel
machines. Indeed, there is a lot of that happenning right here.
My own experience in supercomputing started with loosely coupled systems
and scalar-lloking codes.
But it's too early for the supers to pack up and go home. WAY too early.
Universities are usually several years ahead of the market and they
haven't got it quite right yet.

Shahin Kahn

unread,

Oct 23, 1989, 3:11:45 PM10/23/89

to

In article <36...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks) writes:

>In article <91...@batcomputer.tn.cornell.edu> ka...@batcomputer.tn.cornell.edu (Shahin Kahn) writes:
>>If you have lots of scalar code (or code that doesnt run very fast
>>on a super), and if it can't be rewritten, all you are doing
>>is announcing *your* departure from supercomputers.
>The series of supercomputers made by Cray was for some time the fastest you
>could get for either scalar or vector coding. LLNL in fact has preferred them

>for their superior scalar and short vector performance. (These days, of course

Well, yes. these days you dont always buy a vector machine for its scalar
speed. The problem is really because of lack of modularity in some
vector machines, that you hhave to get the vector whether or not you like
it. There is a definite trend towards more modularity, where the vector
is an option.

In fact, I see no reason why a single scalar cant support more than one vector!

>we are not so pure in our decision making. We prefer Cray machines for their

>software compatibility with the Cray-1. Although we are starting to work on
>the problem, we have not fully embraced the notion of portable operating

Like I said, its a good idea to

REWRITE the code.
The LLNL problem is not a problem. If you are on the cutting edge of
things, sometimes you get cuts!

>to keep track of some of our dusty decks.) The Cyber 205 could provide more

>performance on long vectors, so I guess by your reasoning that you would call
>it a supercomputer and accuse anyone buying a Cray machine of departing from
>the real supercomputers. I guess that world has decided to not use enough

I will give you the benefit ot the doubt and not even respond to this.

Surely you don't really think that's what I meant.
I only used the YMP because you had used the XMP as an example.

>REAL SUPERCOMUTERS, because CDC could not sell enough of them to keep the 205

>and its children on the market. Japanese machines blow the doors off the Cray
>machines these days (scalar or vector), but CRI is not worried about this.

The ETA machine was/is a great machine and absoluttely first-rate technology.

The problem of ETA was not the machine or the architecture. It boiled down
to management even though they had a lot of good people at all levels.
It was too little too late, and a problemm of not having deep-enough
pockets and a lack of beleief, in the business community, in long-term....

> Their real nightmares have KILLER MICROS in them.
YOUR nightmares, maybe.

>
>JUST BECAUSE ONE FRIGGING COMPUTER HAS A HIGHER UPPER BOUND ON THE FLOATING
>POINT RATE THAN ANOTHER DOES NOT MEAN THAT IT IS A BETTER COMPUTER, OR THAT
>IT IS THE "REAL SUPERCOMPUTER". THE BOTTOM LINE IS HOW FAST DOES A MACHINE

All I asked was to request everyone to stop using the

single processor XMP as the definition. Indeed, a supercomputer has
many definitions and all of them are relative to some base characteristic.
As you well know.

>PROCESSORS OPTIMIZED FOR SCALAR PERFORMANCE. BECAUSE OF THIS I AM ACCUSED

>OF DEPARTING FROM SUPERCOMPUTING. THIS IS GIBBERISH. IF I FIND A MIMD MACHINE
>WITH 100 "KILLER MICROS" COSTING THE SAME AS A YMP AND WHICH RUNS MY APPLICATION
>100 TIMES FASTER, I HAVE SIMPLY REDEFINED THE NOTION OF SUPERCOMPUTER.
> FOR MY APPLICATION

BINGO! For Y O U R application! That's the whole point.