Sun Ultra HPC Memory Performance

Hugh LaMaster

unread,

Mar 11, 1997, 3:00:00 AM3/11/97

to

Sun has announced its new Ultra HPC systems today (11 Mar 97),
and according to what I saw, it is a shared-memory/UMA/SMP
design with up to 64 processors in each system, with an
SCI-based interconnect for connecting systems into clusters.

Naturally, the question to ask for a "large-scale" shared-memory
design (and has been since the 60's - see the old architecture
books) is - (drum roll) - memory bandwidth, latency, cost of
synchronization, etc. etc. etc.

Anybody care to characterize the Ultra HPC systems
(bandwidth, latency L1/L2 cache, semaphore overhead, etc)
and compare it to the older Sun Ultra Enterprise systems
as well as other interesting new systems, such as the
(NUMA) SGI Origin 2000 series?

John McCalpin

unread,

Mar 12, 1997, 3:00:00 AM3/12/97

to

In article <3325DB...@nas.nasa.gov>,

Hugh LaMaster <lama...@nas.nasa.gov> wrote:
>Sun has announced its new Ultra HPC systems today (11 Mar 97),

> [....]

>Anybody care to characterize the Ultra HPC systems
>(bandwidth, latency L1/L2 cache, semaphore overhead, etc)
>and compare it to the older Sun Ultra Enterprise systems

This is a very easy comparison to make, since the boxes
are identical except for the labels.

--
--
John D. McCalpin, Ph.D. Supercomputing Performance Analyst
Scalable Systems Group http://reality.sgi.com/employees/mccalpin
Silicon Graphics, Inc. mcca...@sgi.com 415-933-7407

Nick Maclaren

unread,

Mar 12, 1997, 3:00:00 AM3/12/97

to

In article <3325DB...@nas.nasa.gov>, Hugh LaMaster <lama...@nas.nasa.gov> writes:
|> Sun has announced its new Ultra HPC systems today (11 Mar 97),

|> and according to what I saw, it is a shared-memory/UMA/SMP
|> design with up to 64 processors in each system, with an
|> SCI-based interconnect for connecting systems into clusters.

64? Boggle.

|> Naturally, the question to ask for a "large-scale" shared-memory
|> design (and has been since the 60's - see the old architecture
|> books) is - (drum roll) - memory bandwidth, latency, cost of
|> synchronization, etc. etc. etc.

Quite.

|> Anybody care to characterize the Ultra HPC systems
|> (bandwidth, latency L1/L2 cache, semaphore overhead, etc)
|> and compare it to the older Sun Ultra Enterprise systems

|> as well as other interesting new systems, such as the
|> (NUMA) SGI Origin 2000 series?

Um. Before doing that, do we know what purposes it is targetted
for? I can imagine four obvious possibilities:

1) It is intended for purposes that are 'well-behaved' in an
SMP environment. Whether there are enough to make a market for
such an extreme SMP system is anyone's guess.

2) It is actually a vector machine as the user sees it! I.e.
the compilers will convert vector code to the SMP equivalent, in
so far that this is possible.

3) Sun have made a major breakthrough, of a nature that has
defeated all previous manufacturers and which I cannot easily
imagine.

4) Sun have lost their marbles.

For example, if they have taken the second path, the bandwidth is
critical but the latency and semaphore overhead are less so. And
we know that the technology is feasible, though still in need of
development.

Nick Maclaren,
University of Cambridge Computer Laboratory,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679

David Collier-Brown

unread,

Mar 12, 1997, 3:00:00 AM3/12/97

to

Nick Maclaren wrote:
>
> In article <3325DB...@nas.nasa.gov>, Hugh LaMaster <lama...@nas.nasa.gov> writes:
> |> Sun has announced its new Ultra HPC systems today (11 Mar 97),
> |> and according to what I saw, it is a shared-memory/UMA/SMP
> |> design with up to 64 processors in each system, with an
> |> SCI-based interconnect for connecting systems into clusters.
>
> 64? Boggle.

[snip]

> 3) Sun have made a major breakthrough, of a nature that has
> defeated all previous manufacturers and which I cannot easily
> imagine.

In a previous life, I worked for Siemens, who had
a 48-cpu Sequent/Mips box out that turned in respectable
TPC times with all cpus active.

I say respectable as I don't remember the actual numbers,
but the shape of the curve as they went up to 48 cpus was
relatively good.

So it's not a breakthrough, just an evolutionary advance...

--dave (contracting at opcom in toronto) c-b
--
David Collier-Brown, | Always do right. This will gratify some people
185 Ellerslie Ave., | and astonish the rest. -- Mark Twain
Willowdale, Ontario | dav...@hobbes.ss.org, canada.sun.com
N2M 1Y3. 416-223-8968 | http://java.science.yorku.ca/~davecb

Bryan O'Sullivan

unread,

Mar 12, 1997, 3:00:00 AM3/12/97

to

n> 1) It is intended for purposes that are 'well-behaved' in an
n> SMP environment. Whether there are enough to make a market for
n> such an extreme SMP system is anyone's guess.

The answer to this is "yes". There is a significant number of
applications that can scale to a larger number of CPUs, if given the
opportunity.

n> 2) It is actually a vector machine as the user sees it! I.e.
n> the compilers will convert vector code to the SMP equivalent, in
n> so far that this is possible.

Sun's compilers will indeed automatically parallelise loops in the
usual ways.

n> 3) Sun have made a major breakthrough, of a nature that has
n> defeated all previous manufacturers and which I cannot easily
n> imagine.

Well, not really. The Ultra Enterprise 10000 has the key combination
of high bandwidth and low latency, but I think the state of the art
may simply have advanced rather quickly while you weren't looking.

<b

--
Let us pray:
What a Great System. b...@eng.sun.com
Please Do Not Crash. b...@serpentine.com
^G^IP@P6 http://www.serpentine.com/~bos

Greg Lindahl

unread,

Mar 12, 1997, 3:00:00 AM3/12/97

to

In article <3325DB...@nas.nasa.gov>,
Hugh LaMaster <lama...@nas.nasa.gov> wrote:

>Sun has announced its new Ultra HPC systems today (11 Mar 97),
>and according to what I saw, it is a shared-memory/UMA/SMP
>design with up to 64 processors in each system, with an
>SCI-based interconnect for connecting systems into clusters.

I saw a talk about this at CCC97 yesterday afternoon. They do not use
SCI's cache coherancy features, so you can treat SCI as being just a
fast network. The rest of the building blocks are just their existing
SMP systems. I'm all for fast networks; does SCI support switched
topologies? Myrinet has 8-port crossbars shipping today. Both
technologies are at similar speeds today, and SCI offers cache
coherancy features, which Sun isn't using.

-- g

Nick Maclaren

unread,

Mar 12, 1997, 3:00:00 AM3/12/97

to

In article <87pvx4x...@serpentine.com>,

Bryan O'Sullivan <b...@serpentine.com> wrote:
>
>n> 3) Sun have made a major breakthrough, of a nature that has
>n> defeated all previous manufacturers and which I cannot easily
>n> imagine.
>
>Well, not really. The Ultra Enterprise 10000 has the key combination
>of high bandwidth and low latency, but I think the state of the art
>may simply have advanced rather quickly while you weren't looking.

That is certainly possible, though I have reason to believe that it
isn't the case. The general problem with large SMP systems is that
they degrade catastrophically under load or application size. E.g.,
which certain things fit into the caches, everything goes swimmingly
but, just that fraction larger, and the whole system dies.

I am currently posting from a SPARCserver-1000 (with a mere 8 processors)
that does precisely that. When the number of X users exceeds the level
2 cache, the machine dies (and I don't mean just runs slowly). I have
also seen very serious performance problems with other Sun SMP systems
running a single application, when the application's working set reached
a critical size.

The new Sun hardware has VASTLY better memory management, but I don't
think that its performance has increased pro rata with the CPU power
(i.e. 8 times as many much-faster CPUs). But 100 X users is a hard test
of any system :-(

Bart Smaalders

unread,

Mar 12, 1997, 3:00:00 AM3/12/97

to

Nick Maclaren wrote:
>
> In article <87pvx4x...@serpentine.com>,
> Bryan O'Sullivan <b...@serpentine.com> wrote:
> >
> >n> 3) Sun have made a major breakthrough, of a nature that has
> >n> defeated all previous manufacturers and which I cannot easily
> >n> imagine.
> >
> >Well, not really. The Ultra Enterprise 10000 has the key combination
> >of high bandwidth and low latency, but I think the state of the art
> >may simply have advanced rather quickly while you weren't looking.

>

> The new Sun hardware has VASTLY better memory management, but I don't
> think that its performance has increased pro rata with the CPU power
> (i.e. 8 times as many much-faster CPUs).

On a SS10-712 (dual 75Mhz Viking cpus, 1M Ecache),
memcpy of 1 M runs at 20 MB/sec
On a Ultra-2x167, (dual 167 Mhz Ultra cpus, 512K Ecache),
the same test runs at 170 MB/sec.

Take a look at the streams numbers for an even bigger discrepancy.

- Bart
--
Bart Smaalders Solaris Clustering SunSoft
ba...@cyber.eng.sun.com (415) 786-5335 MS UMPK17-301
http://playground.sun.com/~barts 2550 Garcia Ave
Mt View, CA 94043-1100

Bart Smaalders

unread,

Mar 12, 1997, 3:00:00 AM3/12/97

to

Bryan O'Sullivan

unread,

Mar 12, 1997, 3:00:00 AM3/12/97

to

n> That is certainly possible, though I have reason to believe that it
n> isn't the case. The general problem with large SMP systems is that
n> they degrade catastrophically under load or application size.

And this is mostly because of memory bandwidth and latency
limitations. The Ultra Enterprise 10000 has roughly 10 GB/s of memory
bandwidth, which is a whole lot more than the old SS1000 system you're
posting from.

Kai Harrekilde-Petersen

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

lin...@rt.com (Greg Lindahl) writes:
> In article <3325DB...@nas.nasa.gov>,
> Hugh LaMaster <lama...@nas.nasa.gov> wrote:
>
> >Sun has announced its new Ultra HPC systems today (11 Mar 97),
> >and according to what I saw, it is a shared-memory/UMA/SMP
> >design with up to 64 processors in each system, with an
> >SCI-based interconnect for connecting systems into clusters.
>
> I saw a talk about this at CCC97 yesterday afternoon. They do not use
> SCI's cache coherancy features, so you can treat SCI as being just a
> fast network. The rest of the building blocks are just their existing
> SMP systems. I'm all for fast networks; does SCI support switched
> topologies?

You can do all sorts of networks with SCI: rings, counter-rotating
rings, meshes, torus, hierarchial networks, whatnot. You just need a
SCI link with the "right" routing stuff in the chip to keep the
latency low.

Right now, the Ultra systems use our 4port switch, with one Ultra
server on each port.

> Myrinet has 8-port crossbars shipping today. Both technologies are
> at similar speeds today, and SCI offers cache coherancy features,
> which Sun isn't using.

Data General is using SCIs cache coherence in their new NumaLiine
boxes, based on the Intel SHV boards.

Kai
--
Kai Harrekilde-Petersen <k...@dolphinics.no> #include <std/disclaimer.h>
http://www.dolphinics.no/~khp/ Linux: the choice of a GNU generation
"Argue for your limitations, and sure enough - they're yours" --Richard Bach.

Petter Gustad

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

lin...@rt.com (Greg Lindahl) writes:

> In article <3325DB...@nas.nasa.gov>,
> Hugh LaMaster <lama...@nas.nasa.gov> wrote:
>
> >Sun has announced its new Ultra HPC systems today (11 Mar 97),
> >and according to what I saw, it is a shared-memory/UMA/SMP
> >design with up to 64 processors in each system, with an
> >SCI-based interconnect for connecting systems into clusters.
>
> I saw a talk about this at CCC97 yesterday afternoon. They do not use
> SCI's cache coherancy features, so you can treat SCI as being just a
> fast network. The rest of the building blocks are just their existing
> SMP systems. I'm all for fast networks; does SCI support switched

> topologies? Myrinet has 8-port crossbars shipping today. Both

Yes, SCI does support switched topologies. See

http://www.dolphinICS.no/Products/Switches.html

This switch has been shipping for almost years now. Newer SCI switches
are currently under development by Dolphin and other vendors.

> technologies are at similar speeds today, and SCI offers cache
> coherancy features, which Sun isn't using.

Petter
--
________________________________________________________________________
Petter Gustad http://www.dolphinICS.no/~pegu

Nick Maclaren

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

In article <332734...@eng.sun.com>,

Bart Smaalders <ba...@eng.sun.com> wrote:
>Nick Maclaren wrote:
>>
>> In article <87pvx4x...@serpentine.com>,
>> Bryan O'Sullivan <b...@serpentine.com> wrote:
>> >
>> >n> 3) Sun have made a major breakthrough, of a nature that has
>> >n> defeated all previous manufacturers and which I cannot easily
>> >n> imagine.
>> >
>> >Well, not really. The Ultra Enterprise 10000 has the key combination
>> >of high bandwidth and low latency, but I think the state of the art
>> >may simply have advanced rather quickly while you weren't looking.
>
>> The new Sun hardware has VASTLY better memory management, but I don't
>> think that its performance has increased pro rata with the CPU power
>> (i.e. 8 times as many much-faster CPUs).
>
>On a SS10-712 (dual 75Mhz Viking cpus, 1M Ecache),
> memcpy of 1 M runs at 20 MB/sec
>On a Ultra-2x167, (dual 167 Mhz Ultra cpus, 512K Ecache),
> the same test runs at 170 MB/sec.
>
>Take a look at the streams numbers for an even bigger discrepancy.

Thanks for the figures. That confirms my suspicions. The latter is
only 8.5 times faster than the former, but 64 processors of the latter
are 17.8 times faster than 8 processors of the former. Therefore store
limited software has to have its cache miss ratio reduced by a factor
of 2 to achieve the promised performance increase. And that is HARD.

However, the real point is that bandwidth is the key only for vectorised
programs and ones that communicate using large buffers. Typical general
applications are more limited by latency, and I will bet that hasn't
improved by 8 to 1!

David Collier-Brown

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

David Collier-Brown wrote:
> In a previous life, I worked for Siemens, who had
> a 48-cpu Sequent/Mips box out that turned in respectable
> TPC times with all cpus active.

ERROR!! ERRROR!!! RRRRRRRR!!!!
It was a Pyramid box, not a Sequent. (Siemens
owned/owns a big hunk of Pyramid)

The rest of the comment was true, though.

--dave c-b

Del Cecchi

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

So the "SCI" interconnect being used on the new sun systems really operates at 1
GBIT/second, not 1 GBYTE/second? I wondered if this was maybe a typo in the
release on Sun's web page. 1Gb/second doesn't seem very fast.
--

Del Cecchi
Personal Opinions Only

Petter Gustad

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

Petter Gustad <pe...@dolphinics.no> writes:

> lin...@rt.com (Greg Lindahl) writes:
>
> > In article <3325DB...@nas.nasa.gov>,
> > Hugh LaMaster <lama...@nas.nasa.gov> wrote:
> >
> > >Sun has announced its new Ultra HPC systems today (11 Mar 97),
> > >and according to what I saw, it is a shared-memory/UMA/SMP
> > >design with up to 64 processors in each system, with an
> > >SCI-based interconnect for connecting systems into clusters.
> >
> > I saw a talk about this at CCC97 yesterday afternoon. They do not use
> > SCI's cache coherancy features, so you can treat SCI as being just a
> > fast network. The rest of the building blocks are just their existing
> > SMP systems. I'm all for fast networks; does SCI support switched
> > topologies? Myrinet has 8-port crossbars shipping today. Both
>
> Yes, SCI does support switched topologies. See
>
> http://www.dolphinICS.no/Products/Switches.html
>
> This switch has been shipping for almost years now. Newer SCI switches

^
two

Bryan O'Sullivan

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

n> Thanks for the figures. That confirms my suspicions.

Actually, Bart's numbers don't confirm them; they merely offer a hint
of what you might expect. Both sets of numbers for workstations, each
of which has a substantially different memory subsystem from the
servers that use the same kind of CPU.

For a slightly more accurate appraisal, consider STREAM benchmark
results. A 4-way SS1000 gets numbers of 118.8, 121.7, 127.6, and
132.0 for copy, scale, add and triad, respectively; a 4-way UE6000
gets 674.2, 681.0, 680.3, and 711.3. Note that neither system comes
close to maxing out its memory subsystem in these configurations, but
they were the most accessible numbers to hand.

Note also that you *still* can't make statements about the UE10000
based on these numbers, because the 10000 has a substantially
different memory subsystem from the 6000, and we have not yet
published STREAM results for the 10000 (and no, I don't know why).

<b

--
Let us pray:
What a Great System.

Please Do Not Crash.
^G^IP@P6

Alan Charlesworth

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

In article <5g8g0c$6...@lyra.csx.cam.ac.uk>, nm...@cus.cam.ac.uk (Nick
Maclaren) wrote:

The multiprocessor latency (the time per memory access when all processors
in the system are are cache missing doing a Stream-type loop) has improved
significantly.

SS1000
processors time per memory access
1 1,350 ns
8 2,500 ns

Starfire (UE/HPC 10000)
processors time per memory access
1 250 ns
8 260 ns
64 400 ns

The SS1000 has a single 40 MHz 8-byte wide shared address and data XDBus.
It is rather easily saturated, so the latency rises quickly as processors
are added.

The Starfire has four 41.6 MHz coherency paths, and sixteen 83.3 MHz
16-byte wide data paths. The latency curve is rather flat as processors are
added.

The Starfire has about 5.4x better time per memory operation for a single
processor, about 9.6x better for 8 processors in each system, and about
6.2x for fully loaded configurations (8 versus 64 procs).

Fred True

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

In article <87pvx4x...@serpentine.com>,
Bryan O'Sullivan <b...@serpentine.com> wrote:

>n> 1) It is intended for purposes that are 'well-behaved' in an
>n> SMP environment. Whether there are enough to make a market for
>n> such an extreme SMP system is anyone's guess.
>
>The answer to this is "yes". There is a significant number of
>applications that can scale to a larger number of CPUs, if given the
>opportunity.

I posted this a while before, but we've had a gazillion problems with our
outbound news feed so I don't think it ever made it. Since it seems to be
relevant with regard to what sort of market there is for the starfire system
(E10k).

Fred
--

In article <5e7dii$h...@gremlin.backfire.mn.org>,
Charles <cha...@backfire.mn.org> wrote:
>Brad Carlile (brad.c...@Sun.COM) wrote:
>: SPEC is coming soon. TPC-C as you know takes a LOT of equipment to
>: drive the benchmark at a very high rate. You'll see world records
>: coming out on this box once a month or so. We are working
>: feverishly since we just announced the system and demand is high for
>: the box. You'll see more soon.
>
>Could you give some numbers as to how many firm orders you have? I'm curious
>to know who's buying million dollar boxes these days.

My group in AT&T recently purchased two Starfire systems, both fairly well
"loaded" (64 processors, 32GB memory, etc.). They are being installed over
the next few weeks. The teleco market is obviously a natural fit for large
data handling systems. It's still predominantly an MVS dominated niche, but
there is a steady focus on open systems migration.

Large commercial applications have always been heavy consumers of large
("million dollar") systems (including telco, financial, and retail). E.g.
American Express owns a CM-5, MCI owns a large SP2 and several CS6400s,
Wal-Mart owns a large NCR 3600 teradata, Sumitomo Bank owns several CS6400s,
etc.

I know of two other groups in AT&T who are considering Starfire systems for
large business data applications. In addition, I know of at least 3 other
companies who have purchased these systems.

We ran extensive benchmarks of our custom-developed applications and are
extremely pleased with the results on the Starfire. We also performed a
detailed architecture taxonomy during the course of our development to
decide which computing paradigm best fit our application needs (MPP, SMP,
CC-NUMA, etc.). Our application and projected growth were easily large
enough to force us out of the "single SMP" system arena, so some sort of
distributed system was inevitable.

The outcome is that we prefer to handle some data locality at a high level
ourselves (e.g. data partitioning), and exploit the proven benefits of SMP
at a lower level (intra-node parallelism). It is deterministic enough to
manage SMP tuning inside each cluster node (cache/processor data locality via
application architecture, jurisprudent compiler optimizations, and data
structure management). To manage the partitioning at a high level there is
an obvious tradeoff of complexity and manageability versus performance. A
"low" node count in a clustered arrangement allows us to keep the
partitioning algorithms statistically favorable, while mitigating the
sensitivity of the cluster to data or resource skew (problems which steered
us clear of systems like the NCR 5100M or IBM SP2).

In order to keep the node count low, we obviously had to provide a solid
bisection bandwidth between nodes so that data redistribution and key
inversion operations (remember, this is a database) could meet our
performance goals. Fibre channel interfaces from Ancor, as well as SCI
products from Dolphin have allowed us to do this.

Using heuristics of "low node count," "high capacity nodes," "demonstrated
intra-node linear scalability," and "programming environment," among others,
the Starfire was a clear winner. The Origin2000 was certainly a contender,
but we were too leary of data locality issues, and there were other factors
(price, for example).

I certainly would not categorize our application as "typical" - I think that
a lot of people will consider the Starfire because their application will
fit inside a single SMP node, where they can take advantage of time-proven
optimization techniques, and rely on the Starfire's impressive per-processor
scalability (which Brad's latency numbers indicate). Whether larger
applications decide to trust a DSM implementation's locality
characteristics, or stick to explicit (e.g. MPP) parallelism will depend a
lot on specific application needs, as well as people's perception of these
architectures.

Don't forget, also, that characteristics of price/performance, reliability,
and industry knowledge base can also significantly influence purchase
decisions of machines of this magnitude. The Starfire presented a very good
mix of these characteristics.

After seeing all the scalability, latency, locality, and benchmark
discussions primarily between vendors and speculators, I thought some
sincere "customer" viewpoint might be warranted.

Regards,
Fred

--
Fred True "My name is Ozymandias, King of Kings:
AT&T World Class Database Development Look on my works, ye Mighty,
f...@angels.att.com and despair!"
-P. B. Shelley

John McCalpin

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

In article <dpzwwrb...@eng.sun.com>,

Bryan O'Sullivan <b...@eng.sun.com> wrote:
>
>Note also that you *still* can't make statements about the UE10000
>based on these numbers, because the 10000 has a substantially
>different memory subsystem from the 6000, and we have not yet
>published STREAM results for the 10000 (and no, I don't know why).

I have some numbers for the UE 10000 (64-cpu only), but my
understanding is that they were preliminary, so I was waiting
for the rest of the numbers before putting them in the table.

I guess I should follow up on this and find out if I misunderstood
the intent of the message I received from Sun. I was certainly
hoping to get some numbers from smaller processor counts on the
UE10000 as well.

Since the numbers were posted to USENET, I will repeat them here:

Excerpt from Alan Charlesworth <al...@West.Sun.COM>

>Standard results
>----------------
> Sunpro C 4.0, pragmas used to force auto-loop parallelization, 63
>processors, 63 million elements = 1.5 GB of memory.
>
> Total
> including
> Official write
> STREAM allocates % of
> MB/s MB/s peak
>Copy : 6307 9461 89%
>Scale: 6391 9586 90%
>Vadd : 7203 9604 90%
>Triad: 7197 9596 90%
>
>Parallel C doesn't quite saturate the interconect.

VIS-coded kernels were able to drive the network to saturation
(95-97% of peak, which is certainly close enough) and eliminate
the write-allocate traffic at the same time.

Rob Young

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

In article <5g96c0$g...@newsb.netnews.att.com>, f...@angels.att.com (Fred True) writes:

> structure management). To manage the partitioning at a high level there is
> an obvious tradeoff of complexity and manageability versus performance. A
> "low" node count in a clustered arrangement allows us to keep the
> partitioning algorithms statistically favorable, while mitigating the
> sensitivity of the cluster to data or resource skew (problems which steered
> us clear of systems like the NCR 5100M or IBM SP2).
>
> In order to keep the node count low, we obviously had to provide a solid
> bisection bandwidth between nodes so that data redistribution and key
> inversion operations (remember, this is a database) could meet our
> performance goals. Fibre channel interfaces from Ancor, as well as SCI
> products from Dolphin have allowed us to do this.
>
> Using heuristics of "low node count," "high capacity nodes," "demonstrated
> intra-node linear scalability," and "programming environment," among others,
> the Starfire was a clear winner. The Origin2000 was certainly a contender,
> but we were too leary of data locality issues, and there were other factors
> (price, for example).
>

But why not do an end around on the problem of "node locality"
and have the nodes share the same system bus/backplane?

They then communicate at SMP speed!! Now I/O bottlenecks over
the VERY HIGH bandwidth bus are a thing of the past!!!

What am I describing? Digital's Galaxy Software Architecture.
Multiple VMS nodes inside the same box. Digital hinted at tpcs
in the *7 figures* at last Decembers DECUS.

And yes, telcos are VERY VERY interested. ;-)
When? Next Year.

Apparently, an SMP bus makes the BEST node interconnect
there is.

Rob

Steve Kappel

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

In article <5g97nl$17...@news.rchland.ibm.com>,

cec...@signa.rchland.ibm.com (Del Cecchi) writes:
>So the "SCI" interconnect being used on the new sun systems really operates at 1
>GBIT/second, not 1 GBYTE/second? I wondered if this was maybe a typo in the
>release on Sun's web page. 1Gb/second doesn't seem very fast.

Doesn't seem fast compared to what?

I don't recall seeing 1 GB/sec referenced anywhere.
It is 1 Gb/sec - the speed in each direction so there is 200 MB/sec
bandwidth on one channel. The interconnect page says this
is an order of magnitude faster than any currently-available
competing technology.

You can stack multiple channels to get aggregate bandwidths.

Also, remember that this is a cluster channel, not processor
to memory bandwidth. An E10k has a system bus bandwidth
of 12.8 GB(ytes)/sec for its 64 processors.

See http://www.sun.com/hpc/tech/interconnect.html
http://www.sun.com/hpc/products/comp_chart.html
http://www.sun.com/hpc/products/index.html

--
______________________________________________________________________
Steve Kappel http://www.winternet.com/~skappel
steve....@iname.com

Alan Charlesworth

unread,

Mar 13, 1997, 3:00:00 AM3/13/97

to

In article <5g9mqi$k...@murrow.corp.sgi.com>, mcca...@asd.sgi.com wrote:

>
> I have some numbers for the UE 10000 (64-cpu only), but my
> understanding is that they were preliminary, so I was waiting
> for the rest of the numbers before putting them in the table.
>
> I guess I should follow up on this and find out if I misunderstood
> the intent of the message I received from Sun. I was certainly
> hoping to get some numbers from smaller processor counts on the
> UE10000 as well.
>
> Since the numbers were posted to USENET, I will repeat them here:
>

> omitted

> --
> John D. McCalpin, Ph.D. Supercomputing Performance Analyst
> Scalable Systems Group http://reality.sgi.com/employees/mccalpin
> Silicon Graphics, Inc. mcca...@sgi.com 415-933-7407

Sorry John, for my not getting these out to the public sooner. Here are the
Starfire Stream results that I ran at the end of January.

1. Auto-parallel C Stream bandwidth
Copy Scale Vadd Triad
Cpus MBps MBps MBps MBps
1 164 164 202 202
8 1,271 1,270 1,544 1,546
16 2,371 2,414 2,942 2,905
24 3,568 3,577 4,292 4,305
32 4,397 4,408 5,166 5,188
40 5,317 5,374 6,162 6,222
48 5,961 6,056 6,861 6,914
56 6,183 6,304 7,131 7,128
63 6,307 6,391 7,203 7,197

2. Auto-parallel C total interconnect bandwidth

These are the Table 1 numbers, multiplied by 3/2 for copy
and scale, and 4/3 for vadd and triad -- to account for
write-allocate traffic on the interconnect. They are
useful to compare against the peak bandwidth of 10,667 MBps.

Copy Scale Vadd Triad
Cpus MBps MBps MBps MBps
1 246 246 269 269
8 1,907 1,905 2,059 2,062
16 3,557 3,620 3,922 3,873
24 5,353 5,366 5,722 5,740
32 6,595 6,612 6,888 6,917
40 7,976 8,062 8,215 8,296
48 8,942 9,083 9,148 9,219
56 9,274 9,456 9,508 9,505
63 9,461 9,586 9,604 9,596

3. VIS assembler "experimental" Stream bandwidth

The SPARC Visual Instruction Set (VIS) includes
block load and store instructions which move between a
64-byte aligned block of memory and eight floating-point
registers. Because an entire cache-block is accessed, no
extra write-allocate traffic is necessary on the interconnect

Comparing to Table 2, My VIS assembler code loops
get a bit more total interconnect traffic
outstanding than the stock C code did.

Copy Scale Vadd Triad
Cpus MBps MBps MBps MBps
1 325 322 288 263
8 2,499 2,491 2,252 2,099
16 4,527 4,669 4,243 3,944
24 6,720 6,759 6,156 5,860
32 7,872 7,987 7,377 7,092
40 9,277 9,355 8,877 8,594
48 9,938 9,917 9,618 9,373
56 10,250 10,175 10,030 9,910
63 10,307 10,180 10,181 10,107

Mark Hahn

unread,

Mar 14, 1997, 3:00:00 AM3/14/97

to

: 1. Auto-parallel C Stream bandwidth

: Copy Scale Vadd Triad
: Cpus MBps MBps MBps MBps
: 1 164 164 202 202
: 8 1,271 1,270 1,544 1,546
: 16 2,371 2,414 2,942 2,905
: 24 3,568 3,577 4,292 4,305
: 32 4,397 4,408 5,166 5,188
: 40 5,317 5,374 6,162 6,222
: 48 5,961 6,056 6,861 6,914
: 56 6,183 6,304 7,131 7,128
: 63 6,307 6,391 7,203 7,197

these are impressively big numbers. but am I correct in believing
that those 64 processors were working on completely independent
chunks of ungodly-long vectors? and that the vector lengths are
fixed at compile-time? if so, how does it run on a more general
(and realistic) variable-length vector? what's the startup cost
for unleashing this horde of processors? what are the alignment/length
restrictions on the data that determine which processors touch
which memory?

it would be nice if stream required some basic information in a
disclosure, like Spec does. like flags and vector length.

further, how well does it do on data that's not boringly regular,
like, say a big FFT? or better yet, something where accesses are
even less patterned.

finally, just how useful are these numbers? with sufficient
motivation, I expect I could configure a gym full of cheap PCs
to deliver this kind of bandwidth - 112 MB/s per processor is nothing
to write home about...

thanks, mark hahn.
--
operator may differ from spokesperson. ha...@neurocog.lrdc.pitt.edu
http://neurocog.lrdc.pitt.edu/~hahn/

Del Cecchi

unread,

Mar 14, 1997, 3:00:00 AM3/14/97

to

In article <1997Mar13.160620.1@eisner>, you...@eisner.decus.org (Rob Young) write

|>
|> But why not do an end around on the problem of "node locality"
|> and have the nodes share the same system bus/backplane?
|>
|> They then communicate at SMP speed!! Now I/O bottlenecks over
|> the VERY HIGH bandwidth bus are a thing of the past!!!
|>
|> What am I describing? Digital's Galaxy Software Architecture.
|> Multiple VMS nodes inside the same box. Digital hinted at tpcs
|> in the *7 figures* at last Decembers DECUS.
|>
|> And yes, telcos are VERY VERY interested. ;-)
|> When? Next Year.
|>
|> Apparently, an SMP bus makes the BEST node interconnect
|> there is.
|>
|> Rob
|>

A heirarchical bus system can work OK, so long as the nodes are only a few inches
apart, and there aren't too many of them. And of course the best SMP busses are
split response and often have a crossbar data switch. And for optimum signal
integrity and maximum bandwidth point to point wiring rather than multidrop nets
are preferred.

Sort of hard to power and cool all those 100 watt chips without chilled water and
plumbing, and I know a few people who found out how much customers like that.
They used to live near the Hudson River.

Can't wait to see the new machine.

Alan Charlesworth

unread,

Mar 14, 1997, 3:00:00 AM3/14/97

to

In article <5gauot$g...@usenet.srv.cis.pitt.edu>,
ha...@neurocog.lrdc.pitt.edu (Mark Hahn) wrote:

> : 1. Auto-parallel C Stream bandwidth
> : Copy Scale Vadd Triad
> : Cpus MBps MBps MBps MBps
> : 1 164 164 202 202
> : 8 1,271 1,270 1,544 1,546
> : 16 2,371 2,414 2,942 2,905
> : 24 3,568 3,577 4,292 4,305
> : 32 4,397 4,408 5,166 5,188
> : 40 5,317 5,374 6,162 6,222
> : 48 5,961 6,056 6,861 6,914
> : 56 6,183 6,304 7,131 7,128
> : 63 6,307 6,391 7,203 7,197
>
> these are impressively big numbers. but am I correct in believing
> that those 64 processors were working on completely independent
> chunks of ungodly-long vectors? and that the vector lengths are
> fixed at compile-time? if so, how does it run on a more general
> (and realistic) variable-length vector? what's the startup cost
> for unleashing this horde of processors? what are the alignment/length
> restrictions on the data that determine which processors touch
> which memory?

Yes, Stream is an embarassingly parallel benchmark!

The vectors were 2048K elements long per processor up to 32 processors, and
1024K long per processor after that. The arrays were aligned on a
doubleword (8-byte) granularity, with no gaps between the arrays. I used
malloc to create the arrays, so I could avoid recompiling for each
processor count.

I used SunPro C 4.0, with the following switches:
cc -xCC -fast -dalign -xO5 -xchip=ultra
-xrestrict=%all -xparallel -xreduction -xloopinfo

I used pragmas to force the C compiler to parallize the loops. Remember,
Stream loops are all memory bound, so compuiler optimization makes little
difference. Stream is by intent _not_ a compiler test.

I agree that it would be nice to see the vector length recorded in the
Stream result database.

>
> it would be nice if stream required some basic information in a
> disclosure, like Spec does. like flags and vector length.
>
> further, how well does it do on data that's not boringly regular,
> like, say a big FFT? or better yet, something where accesses are
> even less patterned.
>
> finally, just how useful are these numbers? with sufficient
> motivation, I expect I could configure a gym full of cheap PCs
> to deliver this kind of bandwidth - 112 MB/s per processor is nothing
> to write home about...
>
> thanks, mark hahn.
> --

The Stream benchmark may not be perfect, but it is the only public
memory/interconnect micro-benchmark that we have. Lets give John McCalpin
lots of credit for that. One feature of Stream is that it is a simple
benchmark, and hence easy to run without spending an eternity optimizing.
John has collected 378 results at http://www.cs.virginia.edu/stream/.

It _would_ be interesting to add a couple of patterns other than
embarassingly parallel stride-1 vectors to Stream. John has posted the
following on the Stream web site:

"Extensions of the STREAM benchmark for the future are currently being
considered. The main issues that need to be addressed are:

1. Memory Hierarchies: STREAM needs to be extended to measure
bandwidths at each level of the memory hierarchy.

2. Latency: Bandwidth and Latency are a powerful pair of descriptors for
memory systems -- Latency measurements should be added.
Access Patterns: Currently STREAM measures only unit-stride
performance. This is easy and sensible, but non-unit stride and
irregular/indirect performance are an important piece of the memory
system performance picture.

3. Locality: Many new machines are being developed with physically
distributed main memory. STREAM may be enhanced to measure
bandwidth/latency between "nodes" of distributed shared memory
systems."

Helping to make Stream a better memory benchmark would make a useful
discussion on comp.bencharks.

As for aggregates of PCs, I personally believe that Stream results _should_
only be for compiled code running in parallel, rather than for aggregate
runs of many single-thread processes -- but John is the arbiter of the
rules.

Since Starfire is a a uniform-memory-access (UMA) system, it would do
relatively better than NUMA/cluster systems on memory benchmarks that had
some data sharing. Stream represents a best case for NUMA machines, since
the vector lengths can be sized to fit in local memory.

John McCalpin

unread,

Mar 14, 1997, 3:00:00 AM3/14/97

to

In article <alanc-ya02408000...@engnews2.eng.sun.com>,

Alan Charlesworth <al...@west.sun.com> wrote:
>
>As for aggregates of PCs, I personally believe that Stream results _should_
>only be for compiled code running in parallel, rather than for aggregate
>runs of many single-thread processes -- but John is the arbiter of the
>rules.

I can be pursuaded to change the rules if you come up with a
convincing argument.

The split of the result into three categories
standard
experimental and/or hand-coded
32-bit
was a change of the original rules, based on other peoples input.

The new Sun results and some new SGI results are in the master
tables for STREAM (which I keep on my workstation), and should migrate
to the U.Va. mirror site in the wee hours of tomorrow morning.

Be sure to look at both the "Standard" and "Experimental" tables,
since I have new results in each.

In particular, the VIS-coded Sun 10000 results are in the Experimental
tables, as are the 64, 96, and 120 cpu SGI/Cray Origin 2000 results.

These SGI/Cray results are not assembly language coded, but they are in
the "experimental" tables because we have not yet formally released the
64 cpu or larger machines to the field. As soon as we do, I will shift
the results to the standard table.

>Since Starfire is a a uniform-memory-access (UMA) system, it would do
>relatively better than NUMA/cluster systems on memory benchmarks that had
>some data sharing.

That certainly remains to be seen, now, doesn't it? Since the only
other benchmark result that Sun has released is LINPACK NxN (which less
useful in the HPC space than STREAM, IMHO), we still don't know about
sharing issues.

Memory system microbenchmarks should look pretty good on the Sun,
but personal guess is that the SGI/Cray Origin 2000 will have about a
2:1 price/performance advantage over the UltraHPC 10000 for more
realistic HPC codes like the NAS parallel benchmarks.

Can we expect any real HPC benchmarks out of Sun?
Even SPECfp_rate numbers for the 10000 would be nice....

>Stream represents a best case for NUMA machines, since
>the vector lengths can be sized to fit in local memory.

That is sort of a funny way to look at it. Since STREAM is often set
up to use all the cpus, there is no other memory in the machine to use
if you overflow the local memories.

The DSM approach is nice because any node can use any/all memory in the
machine if that is what you want to do. So we get the scalable (local)
bandwidth of an MPP without the usage restrictions of the MPPs.

But you are correct that performance will degrade slightly if the
data is off-node. STREAM Triad drops about 12% when all the data
is being read from and written to a node that is a couple of
router hops away.
--

Fred True

unread,

Mar 14, 1997, 3:00:00 AM3/14/97

to

In article <1997Mar13.160620.1@eisner>,

Rob Young <you...@eisner.decus.org> wrote:
>
> But why not do an end around on the problem of "node locality"
> and have the nodes share the same system bus/backplane?

To quote Richard Feynman: "For a successful technology, reality must take
precedence over public relations, for nature cannot be fooled" (although in
this case I think "marketing" can be substituted for "public relations").

My point is that the latest SMP backplanes and centerplanes are being pushed
to their physical limits as far as signal timings are concerned. In order
to keep latency low and flat, there are tangible limits on the trace sizes
and properties on the bus/interconnect/crossbar signal paths. The Sun BSD
folks did a great job of this on the starfire, keeping within calculated
latency (and price) design goals. If you want to extend this to more
processors, tradeoffs must be made: tolerance for higher latencies being the
primary of these.

Sure, Sun could glue together a few fast SMP centerplanes (like the gigaplane
or the starfire's crossbar), but then they'd have to introduce tolerance for
longer latencies when data is not local to the "home node" and while this
would allow more processors in the game, it does noticeable things to the
worst-case latencies, and this of course can cascade to cause processor
pipeline stalls, etc. You could also employ clever packaging and cooling
technology to try to minimize signal path delays, as Cray was famous for.

Extend the concept further and you have the basis of something quite like
the Origin, S3.mp, T3E, etc. The Origin is a very good example of this
other approach (of allowing and working around longer latencies), and as
SGI's latency and STREAM numbers have shown, it is quite successful at it.

> They then communicate at SMP speed!! Now I/O bottlenecks over
> the VERY HIGH bandwidth bus are a thing of the past!!!

Would that this were true. The only thing that will make interconnect
latency/bottlenecks a thing of the past for cache un-friendly applications
would be to invert the processor speed curve over the last few years. Stick
a bunch of Z80 processors in an Origin or Starfire design and I'm sure the
interconnect would not be anyone's problem.

> What am I describing? Digital's Galaxy Software Architecture.
> Multiple VMS nodes inside the same box.

Oh goodie, and it's VMS too. (sorry, I do like VMS, I grew up on it, but it
is not pervasive in the, or at least THIS, corporate world).

Alan Charlesworth

unread,

Mar 14, 1997, 3:00:00 AM3/14/97

to

In article <5gc9rl$2...@murrow.corp.sgi.com>, mcca...@asd.sgi.com wrote:

> In article <alanc-ya02408000...@engnews2.eng.sun.com>,
> Alan Charlesworth <al...@west.sun.com> wrote:
> >
> >As for aggregates of PCs, I personally believe that Stream results _should_
> >only be for compiled code running in parallel, rather than for aggregate
> >runs of many single-thread processes -- but John is the arbiter of the
> >rules.
>
> I can be pursuaded to change the rules if you come up with a
> convincing argument.
>
> The split of the result into three categories
> standard
> experimental and/or hand-coded
> 32-bit
> was a change of the original rules, based on other peoples input.

Well OK then, how about adding two columns in your database of results at
ftp://ftp.cs.virginia.edu/pub/stream/Tables/stream.db

1. a yes/no column for aggregate versus parallel execution
2. the vector length of the run

That way a user of the results could tell whether a result was aggregate or
parallel execution, and what the vector lengths of the run was.

Mike Greaves

unread,

Mar 14, 1997, 3:00:00 AM3/14/97

to

Del Cecchi wrote:

> So the "SCI" interconnect being used on the new sun systems really operates at 1
> GBIT/second, not 1 GBYTE/second? I wondered if this was maybe a typo in the
> release on Sun's web page. 1Gb/second doesn't seem very fast.

I quickly read a description of the SCI a month or two back. If I
remember correctly, the copper version has 18 twisted pairs for data - 2
bytes plus parity (or maybe more sophisticated error handling) signalled
differentially. Clock speed is 250 MHz, but it's "double-clocked" -
data transfers occur on both rising and falling edges of the clock
waveform. Doing our math, thats (2 bytes per transfer) x (2 transfers
per clock cycle) x (250 MHz clock speed) = 1 GByte per second. Sounds
plenty fast to me! I'm just typing this from memory and I'm definitely
not a hardware engineer so I hope this picture is sufficiently accurate.

Mike Greaves
mgre...@pathcom.com

Greg Lindahl

unread,

Mar 14, 1997, 3:00:00 AM3/14/97

to

In article <5gauot$g...@usenet.srv.cis.pitt.edu>,
Mark Hahn <ha...@neurocog.lrdc.pitt.edu> wrote:

>finally, just how useful are these numbers? with sufficient
>motivation, I expect I could configure a gym full of cheap PCs
>to deliver this kind of bandwidth - 112 MB/s per processor is nothing
>to write home about...

Two words: shared memory.

You can configure a distributed memory to do this easily. But this is
the fastest shared memory machine outside of the vector machines (a
Cray C90 beats this result with only 1 cpu... I suppose I should be
disappointed that this Cray design could not get as much bandwidth
per processor maxed out as an Ultra I.)

SGI Origin 2000 machines have this kind of bandwidth to local memory,
but not with uniform access. Ditto for HP/Convex Exemplar (even less
uniform than SGI) and DEC's "clusters" (in which the interconnect has
great latency for locking purposes, but no bandwidth to speak of.)

If you have a problem that requires this kind of bandwidth and uniform
access, then you'll find this number useful. If you don't understand
what the benchmark is doing, then you probably don't have a problem
which requires this kind of bandwidth.

-- g

Casper H.S. Dik - Network Security Engineer

unread,

Mar 14, 1997, 3:00:00 AM3/14/97

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:

}>On a SS10-712 (dual 75Mhz Viking cpus, 1M Ecache),
}> memcpy of 1 M runs at 20 MB/sec
}>On a Ultra-2x167, (dual 167 Mhz Ultra cpus, 512K Ecache),
}> the same test runs at 170 MB/sec.
}>
}>Take a look at the streams numbers for an even bigger discrepancy.

}Thanks for the figures. That confirms my suspicions. The latter is
}only 8.5 times faster than the former, but 64 processors of the latter
}are 17.8 times faster than 8 processors of the former. Therefore store
}limited software has to have its cache miss ratio reduced by a factor
}of 2 to achieve the promised performance increase. And that is HARD.

It doesn't really confirm your suspicions as he quoted Ultra-2 numbers,
not UE 10000 numbers. The UE 10000 has a bandwitdh of 10GB/s which
is roughly 40 times the bandwitdh of a SS1000 (which runs at 250MB/s)

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Kai Harrekilde-Petersen

unread,

Mar 17, 1997, 3:00:00 AM3/17/97

to

Mike Greaves <mgre...@pathcom.com> writes:
> Del Cecchi wrote:
> > So the "SCI" interconnect being used on the new sun systems really
> > operates at 1 GBIT/second, not 1 GBYTE/second? I wondered if this
> > was maybe a typo in the release on Sun's web page. 1Gb/second
> > doesn't seem very fast.
>
> I quickly read a description of the SCI a month or two back. If I
> remember correctly, the copper version has 18 twisted pairs for data - 2
> bytes plus parity (or maybe more sophisticated error handling) signalled
> differentially.

It's not parity. It's a flag that is used to delimit packets, for
easier decoding. Each packet is protected by a 16bit CRC (the
CCITT-CRC polynomial), so that gives plenty of error detection.

Kai
--
Kai Harrekilde-Petersen <k...@dolphinics.no> http://www.dolphinics.no/~khp/
"Argue for your limitations, and sure enough - they're yours". Richard Bach

Larry Kilgallen

unread,

Mar 17, 1997, 3:00:00 AM3/17/97

to

In article <5gcnt8$3...@newsb.netnews.att.com>, f...@angels.att.com (Fred True) writes:

> Oh goodie, and it's VMS too. (sorry, I do like VMS, I grew up on it, but it
> is not pervasive in the, or at least THIS, corporate world).

It would seem that those who seek the highest possible speed machine
must allow themselves flexibility in some other areas. Certainly
those who purchased the first Crays did not have the luxury of
specifying their (various) previous favorite operating systems
to run on the Cray.

Cray, on the other hand, had to provide Fortran support to lure
those with existing code. In the intervening years we have come
a long way in language standardization, and anyone who is still
programming exclusively in non-portable assembly language will
not even be able to move between two vendors who each call their
operating system "Unix".

Larry Kilgallen

Del Cecchi

unread,

Mar 17, 1997, 3:00:00 AM3/17/97

to

In article <332A14...@pathcom.com>, Mike Greaves <mgre...@pathcom.com> writes:
|> Del Cecchi wrote:
|>
|> > So the "SCI" interconnect being used on the new sun systems really operates at 1
|> > GBIT/second, not 1 GBYTE/second? I wondered if this was maybe a typo in the
|> > release on Sun's web page. 1Gb/second doesn't seem very fast.
|>
|> I quickly read a description of the SCI a month or two back. If I
|> remember correctly, the copper version has 18 twisted pairs for data - 2
|> bytes plus parity (or maybe more sophisticated error handling) signalled

|> differentially. Clock speed is 250 MHz, but it's "double-clocked" -
|> data transfers occur on both rising and falling edges of the clock
|> waveform. Doing our math, thats (2 bytes per transfer) x (2 transfers
|> per clock cycle) x (250 MHz clock speed) = 1 GByte per second. Sounds
|> plenty fast to me! I'm just typing this from memory and I'm definitely
|> not a hardware engineer so I hope this picture is sufficiently accurate.
|>
|> Mike Greaves
|> mgre...@pathcom.com

The 1GByte is what "normal" parallel SCI is supposed to be. There is also a
serial SCI and other variations. SUN's web site says 1Gbit not 1GByte/sec. I
guess I will have to wander off to the Dolphin web site, as it has been implied
that they are using the Dolphin switch. Anybody willing to say for sure?
--

Del Cecchi
cecchi@rchland

Larry Kilgallen

unread,

Mar 17, 1997, 3:00:00 AM3/17/97

to

In article <5gd7ur$q...@rt.com>, lin...@rt.com (Greg Lindahl) writes:

> Two words: shared memory.
>
> You can configure a distributed memory to do this easily. But this is
> the fastest shared memory machine outside of the vector machines (a
> Cray C90 beats this result with only 1 cpu... I suppose I should be
> disappointed that this Cray design could not get as much bandwidth
> per processor maxed out as an Ultra I.)
>
> SGI Origin 2000 machines have this kind of bandwidth to local memory,
> but not with uniform access. Ditto for HP/Convex Exemplar (even less
> uniform than SGI) and DEC's "clusters" (in which the interconnect has
> great latency for locking purposes, but no bandwidth to speak of.)

I presume by "great" you mean "quite acceptable" rather than "large".

VMSclusters existed for 8 years before reflective memory was offered.
The DEC advice is to use the reflective memory for locking and
some other interconnect for bandwidth. Those other interconnects
(CI, DSSI, Ethernet, FDDI, ATM) were already in use at existing
cluster sites, The incremental effort to run a second cable between
machines seems no greater than the difference between existing
interconnects which use a single cable (Ethernet) and those which
use multiple cables (CI).

I think the message for comp.arch is that the best performance
will not always be achieved by using a single base technology
for all problems which at first glance resemble each other,
and that certainly seems to be the spirit of NUMA as well.

Larry Kilgallen

Kai Harrekilde-Petersen

unread,

Mar 18, 1997, 3:00:00 AM3/18/97

to

Sun are using our switch and our SBus-SCI boards. The Link speed is
200MByte/sec, using PECL, with the LincController-1 (LC1). So, it is
really 1.6Gbit/sec. Our 2nd generation LincController (LC2) uses LVDS
signalling and runs at 500MByte/sec.

The surfers may try http://www.dolphinics.no/ and http://www.dolphinics.com/

Ling Wang

unread,

Mar 18, 1997, 3:00:00 AM3/18/97

to

Mike Greaves wrote:
>
> Del Cecchi wrote:
>
> > So the "SCI" interconnect being used on the new sun systems really operates at 1
> > GBIT/second, not 1 GBYTE/second? I wondered if this was maybe a typo in the
> > release on Sun's web page. 1Gb/second doesn't seem very fast.
>
> I quickly read a description of the SCI a month or two back. If I
> remember correctly, the copper version has 18 twisted pairs for data - 2
> bytes plus parity (or maybe more sophisticated error handling) signalled
> differentially. Clock speed is 250 MHz, but it's "double-clocked" -
> data transfers occur on both rising and falling edges of the clock
> waveform. Doing our math, thats (2 bytes per transfer) x (2 transfers
> per clock cycle) x (250 MHz clock speed) = 1 GByte per second. Sounds
> plenty fast to me! I'm just typing this from memory and I'm definitely
> not a hardware engineer so I hope this picture is sufficiently accurate.

But how are they driving this off the existing hardware?
Are they interfacing it thru the UPA interface async?
If it is going thru the SBus, then they would be nowhere
near that speed.

And on Sun's marketing hoopla:

http://www.sun.com/products-n-solutions/hw/servers/hpc/tech/interconnect.html

It still specifies 1GBIT/Sec

Kai Harrekilde-Petersen

unread,

Mar 18, 1997, 3:00:00 AM3/18/97

to

Ling Wang <wang...@jpmorgan.com> writes:
> Mike Greaves wrote:
> > Del Cecchi wrote:
> > > So the "SCI" interconnect being used on the new sun systems really operates at 1
> > > GBIT/second, not 1 GBYTE/second? I wondered if this was maybe a typo in the
> > > release on Sun's web page. 1Gb/second doesn't seem very fast.
> >
> > I quickly read a description of the SCI a month or two back. If I
> > remember correctly, the copper version has 18 twisted pairs for data - 2
> > bytes plus parity (or maybe more sophisticated error handling) signalled
> > differentially. Clock speed is 250 MHz, but it's "double-clocked" -
> > data transfers occur on both rising and falling edges of the clock
> > waveform. Doing our math, thats (2 bytes per transfer) x (2 transfers
> > per clock cycle) x (250 MHz clock speed) = 1 GByte per second. Sounds
> > plenty fast to me! I'm just typing this from memory and I'm definitely
> > not a hardware engineer so I hope this picture is sufficiently accurate.
>
> But how are they driving this off the existing hardware?
> Are they interfacing it thru the UPA interface async?
> If it is going thru the SBus, then they would be nowhere
> near that speed.

They go through SBus. The SCI interconnect is running at
200MByte/sec, and that's the speed they are quoting.

Hugh LaMaster

unread,

Mar 20, 1997, 3:00:00 AM3/20/97

to

Thanks to Sun/Alan Charlesworth for publishing these:

> 1. Auto-parallel C Stream bandwidth
> Copy Scale Vadd Triad
> Cpus MBps MBps MBps MBps
> 1 164 164 202 202
> 8 1,271 1,270 1,544 1,546
> 16 2,371 2,414 2,942 2,905
> 24 3,568 3,577 4,292 4,305
> 32 4,397 4,408 5,166 5,188
> 40 5,317 5,374 6,162 6,222
> 48 5,961 6,056 6,861 6,914
> 56 6,183 6,304 7,131 7,128
> 63 6,307 6,391 7,203 7,197

Thanks also for posting these:

> The multiprocessor latency (the time per memory
access when all processors
> in the system are are cache missing doing a
Stream-type loop) has improved
> significantly.
>
> SS1000
> processors time per memory access
> 1 1,350 ns
> 8 2,500 ns
>
> Starfire (UE/HPC 10000)
> processors time per memory access
> 1 250 ns
> 8 260 ns
> 64 400 ns
>

These latency numbers were very useful, although they
could be further explicated with an lmbench comparison.

---

As for the results, I think the numbers from several vendors
for recent systems, including Sun, reflect the fact that
workstation/server vendors now realize that memory performance
is the key to real, sustained performance. [In particular, there
was a trend, a couple of years back, to allow the L2 cache
to significantly degrade the performance to main memory, even
though server benchmarks showed very clearly that this strategy
was counterproductive. Depending on the workload, on some of
those systems performance was better *without* the expensive
L2 cache.] The last year or so has been gratifying, with several
vendors putting out decent-memory-performance systems. I don't
know how R-A-S features now compare with big iron, but,
performance-wise, these new systems should do very well
in what were previously "mainframe" applications. I wonder if
anyone will try to run an airline reservation system on the
new servers?

Hugh LaMaster

unread,

Mar 21, 1997, 3:00:00 AM3/21/97

to

Fred True wrote:
>
> In article <1997Mar13.160620.1@eisner>,
> Rob Young <you...@eisner.decus.org> wrote:
> >
> > But why not do an end around on the problem of "node locality"
> > and have the nodes share the same system bus/backplane?
>
> To quote Richard Feynman: "For a successful technology,
reality must take
> precedence over public relations, for nature cannot be fooled"
(although in
> this case I think "marketing" can be substituted for
"public relations").
>
> My point is that the latest SMP backplanes and centerplanes
are being pushed
> to their physical limits as far as signal timings are concerned.

At some point, you go beyond "backplane" and on to "network".
It is interesting that so far SCI has found several takers
as far as a clustering technology, but, the high-visibility
customers seem to be shying away from using the
shared-memory/cache/coherence SCI features. [It probably
has to do with the fact that even in a single system,
single cache-coherent buses are giving way to "switches"
as a way to connect processors and memory. Maybe the
current SCI chipsets make [bus] assumptions that are
difficult to fulfill or work around. Pure 100% guess.]

So, anyway, quite a few postings in this thread seem
to be discussing the best technology for *clustering*.

Is it, indeed, a hierarchy of cache-coherent buses?
[I seem to recall a company in Cambridge, Mass. building
machines based on this idea.]

Or, is it SCI, with or without cache/coherent features?
Or, if you decide you just want a "network" - that is,
give up on shared-memory - how about HiPPI-6400 or FCS?
There are zillions of web pages on these latter three,
with products in various stages of development and production.

Alan Charlesworth

unread,

Mar 23, 1997, 3:00:00 AM3/23/97

to

In article <333191...@nas.nasa.gov>, Hugh LaMaster
<lama...@nas.nasa.gov> wrote:

> These latency numbers were very useful, although they
> could be further explicated with an lmbench comparison.
>

Lmbench measures _single_ procesor back-to-back load latency -- the time
per load, assuming that the instructions before and after are also
cache-missing loads. The outer loop is the stride size, and the inner loop
is the array size. For each array size, the benchmark creates a ring of
pointers that points forward by the stride size. Traversing the array is
done by:
p = (char **)*p;

I've collected profiles for three members of the Sun Ultra family, a
workstation, a midrage server (UE6000), and a high-end server
(Starfire/UE10000). As you would expect, the smaller machines have the
shortest latency, since they have the fewest layers of interconnect to
traverse.

Servers are designed with lots of bandwidth to keep latency reasonable when
many processors are all cache-missing at the same time. So usually there
will be a cross-over point where the smaller system's interconnect gets
saturated, and its _multiple-processor_ latency rises above that of the
larger machine,

I would be pleased to post lmbench results. To be specific, the
lat_mem_rd.c printout for a stride equal to the outermost-level cache block
size. Alas, I don't want to be the _only_ result posted. It would be
certainly be interesting to see profiles for all the current architetcures:
SGI, DEC, HP, Intel, and Sun.

There are NO current lmbench results posted on
http://reality.sgi.com/employees/lm/lmbench/lmbench.html. The latest are
from more than a year ago: January 1996. After all, what vendor is going to
post result to another vendor's web site!

What I modestly :-) propose is that someone with a neutral web site offer
to collect and post lat_mem.c profiles. One possibility would be for John
McCalpin offer to collect them on the Stream web site:
http://www.cs.virginia.edu/stream/. Yes, I know that John now works for
SGI, but a university posting place would provide a fig leaf of neutrality.

As an example of a lat_mem_read trace, here is one for an Ultra 1/170
UltraSPARC-I workstation -- for the outermost cache block size of 64 bytes.
I used Larry McVoy's most recent version of the mhz.c routine (which
measures cpu clock) which he posted awhile ago on this newsgroup. In the
results below, the first number is the data size in KB, and the second is
the reported latency in ns.

What we see is a 1 processor-clock latency to internal cache (16 KB direct
mapped), a 7 processor-clock latency to external cache (512 KB dirrect
mapped), and a 44 processor-clock latency to memory. In terms of
interconnect clocks, the memory latency is 22 clocks.

len=8388608
clk=6.00 ns 167 MHz
stride=64
Ultra 1-170 workstation
KB ns
0.5 5 <--+
1 5 |
2 5 |
3 5 |
4 5 |
6 5 Internal cache latency plateau
8 5 |
10 5 |
12 5 |
14 5 |
16 5 <--+
18 13
20 20
22 25
24 29
26 33
28 36
30 39
32 42 <--+
36 41 |
40 41 |
44 41 |
48 41 |
52 41 |
56 41 |
60 41 |
64 41 |
72 41 |
80 42 External cache latency plateau
88 42 |
96 41 |
104 42 |
112 41 |
120 42 |
128 42 |
144 42 |
160 42 |
176 42 |
192 42 |
208 48 |
224 48 |
240 47 |
256 47 |
280 47 |
320 46 |
362 46 |
384 45 |
416 45 |
448 45 |
480 45 <--+
512 71
1024 255 <--+
1536 262 |
2048 264 |
2560 264 |
3072 264 |
3584 265 |
4096 265 |
4608 265 Memory latency plateau
5120 264 |
5632 263 |
6144 264 |
6656 265 |
7168 265 |
7680 264 |
8192 265 <--+

Petter Gustad

unread,

Mar 24, 1997, 3:00:00 AM3/24/97

to

Hugh LaMaster <lama...@nas.nasa.gov> writes:

I think you're right in that some people get scared away by cache
coherency protocols. However, you don't have to use the cache
coherency feature of SCI to utilized the high bandwidth and low
latency SCI gives you. E.g. the Dolphin Link Controller family chips
*supports* cache coherency, but does not *implement* cache coherency.
The cache coherency protocol is implemented in a separate chip. Hence,
it's possible to use SCI as a fast non-coherent interconnect.

Building a switched topology is quite simple using SCI. The Dolphin
Link Controller family is used in switch products as well as in
non-coherent (loosely coupled I/O attached) and coherent (tightly
coupled processor attached) products.

See http://www.dolphinics.no and http://www.SCIzzL.com/ for more info
on SCI.

> So, anyway, quite a few postings in this thread seem
> to be discussing the best technology for *clustering*.
>
> Is it, indeed, a hierarchy of cache-coherent buses?
> [I seem to recall a company in Cambridge, Mass. building
> machines based on this idea.]
>
> Or, is it SCI, with or without cache/coherent features?
> Or, if you decide you just want a "network" - that is,
> give up on shared-memory - how about HiPPI-6400 or FCS?
> There are zillions of web pages on these latter three,
> with products in various stages of development and production.

See http://www.SCIzzL.com/SCIvsEtc.html for a comparision of
SCI/HiPPI,FCS,ATM,etc.

Petter
--
________________________________________________________________________
Petter Gustad http://www.dolphinICS.no/~pegu

Nick Maclaren

unread,

Mar 24, 1997, 3:00:00 AM3/24/97

to

I have just been to a talk on the new Sun architecture, which clarifies
and confirms most of the points made in this discussion. Sun are most
definitely back as a serious supplier of scientific computing engines,
though I remain unconvinced about the scalability of the Starfire. Here
are what I estimate that it will do on various forms of load, assuming
64 processors:

Entirely local number-twiddling 10-20 GFlops
Very long matrix operations 0.5-1 Gflops
Gruesomely serialised code down to 5 MFlops

Both the first and last are pretty unrealistic, but they indicate that
the system will degrade pretty badly as the application becomes more
memory-limited and less suitable for SMP hardware. So, will there be
enough applications to make the 64 processor version useful? It will
be interesting to see!

Note that this is separate from the question of whether it will make
money, which is not closely related to its technical merits.

Nick Maclaren,
University of Cambridge Computer Laboratory,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679

Alan Charlesworth

unread,

Mar 24, 1997, 3:00:00 AM3/24/97

to

In article <5h6ppe$8...@lyra.csx.cam.ac.uk>, nm...@cus.cam.ac.uk (Nick
Maclaren) wrote:

We have so far posted a couple of simple MFLOP results:

Linpack-N (March 9, 1997)
Num_Procs Rmax(GFlops) Nmax(order) N1/2(order) Rpeak(GFlops)
64 21.37 15000 4200 32.0

This is highly parallel matrix work that can be blocked well into caches.
See:
http://performance.netlib.org/performance/html/linpack-parallel.data.col0.html

STREAM MFlops
ncpus SCALE ADD TRIAD
63 532.6 600.2 599.8 standard result (parallel C code)
63 848.3 848.4 842.2 experimental result (parallel VIS assembler)

This is embarassingly parallel vector work that is entirely cache-missing,
i.e. memory-to-memory.
See http://www.cs.virginia.edu/stream/standard/MFLOPS.html

Nick Maclaren

unread,

Mar 25, 1997, 3:00:00 AM3/25/97

to

In article <alanc-ya02408000...@engnews2.eng.sun.com>, al...@west.sun.com (Alan Charlesworth) writes:
|> In article <5h6ppe$8...@lyra.csx.cam.ac.uk>, nm...@cus.cam.ac.uk (Nick
|> Maclaren) wrote:
|>
|> > I have just been to a talk on the new Sun architecture, which clarifies
|> > and confirms most of the points made in this discussion. Sun are most
|> > definitely back as a serious supplier of scientific computing engines,
|> > though I remain unconvinced about the scalability of the Starfire. Here
|> > are what I estimate that it will do on various forms of load, assuming
|> > 64 processors:
|> >
|> > Entirely local number-twiddling 10-20 GFlops
|> > Very long matrix operations 0.5-1 Gflops
|> > Gruesomely serialised code down to 5 MFlops
|>

Ah. Thanks. I missed that. So the CPU does quadruple issue FP, too.
Change that 10-20 to 20-40 :-)

Alan Charlesworth

unread,

Mar 25, 1997, 3:00:00 AM3/25/97

to

In article <5h8djn$3...@lyra.csx.cam.ac.uk>, nm...@cus.cam.ac.uk (Nick
Maclaren) wrote:

Us old floppers would wish it were so, but alas, while the UltraSPARC-I and
II can issue four instructions per clock, there cxan be only 1 load or
store, 1 FP multiply, and 1 FP add. There is no combined multiply-add in
the SPARC architecture -- so SPARC gives up a factor of two in peak
performance to those micros that can to a multiply-add as one FP op: HP,
MIPS, and PPC (if my memory is correct).

--
Alan Charlesworth
Business Systems Diivision Engineering
Sun Microsystems

John McCalpin

unread,

Mar 25, 1997, 3:00:00 AM3/25/97

to

In article <alanc-ya02408000...@engnews2.eng.sun.com>,
Alan Charlesworth <al...@west.sun.com> wrote:

>In article <5h8djn$3...@lyra.csx.cam.ac.uk>, nm...@cus.cam.ac.uk (Nick
>Maclaren) wrote:
>

>> [...] So the CPU does quadruple issue FP, too.
>>
>> Nick Maclaren,

>
>Us old floppers would wish it were so, but alas, while the UltraSPARC-I and
>II can issue four instructions per clock, there cxan be only 1 load or
>store, 1 FP multiply, and 1 FP add. There is no combined multiply-add in
>the SPARC architecture -- so SPARC gives up a factor of two in peak
>performance to those micros that can to a multiply-add as one FP op: HP,
>MIPS, and PPC (if my memory is correct).

Here is how I understand the field:

max FP
cpu adders multipliers fused ops/clock
---------------------------------------------------------
UltraSparc2 1 1 0 2

MIPS R8000 0 0 2 4
MIPS R10000 1 1 0 2

PA-7200 1 1 0 2
PA-8000 0 0 2 4

PowerPC 0 0 1 2
Power2/P2SC 0 0 2 4

Alpha 21164 1 1 0 2
---------------------------------------------------------

Corrections are welcome.....

Preston Briggs

unread,

Mar 27, 1997, 3:00:00 AM3/27/97

to

Alan Charlesworth <al...@west.sun.com> wrote:
>We have so far posted a couple of simple MFLOP results:
>Linpack-N (March 9, 1997)
>Num_Procs Rmax(GFlops) Nmax(order) N1/2(order) Rpeak(GFlops)
> 64 21.37 15000 4200 32.0

>STREAM MFlops

>ncpus SCALE ADD TRIAD
> 63 532.6 600.2 599.8 standard result (parallel C code)

>This is embarassingly parallel vector work that is entirely cache-missing,
>i.e. memory-to-memory.

Why only 63 processors?

Incidentally, STREAM is not entirely out of cache. It accesses memory
with unit stride, so there's quite a lot of benefit from long cache
lines.

Preston Briggs

Alan Charlesworth

unread,

Mar 30, 1997, 3:00:00 AM3/30/97

to

In article <5hei7l$c4v$1...@joe.rice.edu>, pre...@cs.rice.edu (Preston
Briggs) wrote:

> Alan Charlesworth <al...@west.sun.com> wrote:
> >We have so far posted a couple of simple MFLOP results:
> >Linpack-N (March 9, 1997)
> >Num_Procs Rmax(GFlops) Nmax(order) N1/2(order) Rpeak(GFlops)
> > 64 21.37 15000 4200 32.0
>
> >STREAM MFlops
> >ncpus SCALE ADD TRIAD
> > 63 532.6 600.2 599.8 standard result (parallel C code)
>
> >This is embarassingly parallel vector work that is entirely cache-missing,
> >i.e. memory-to-memory.
>
> Why only 63 processors?
>

I didn't want any other activity in the system to interfere with my
timings. To avoid having one processor get distracted in the middle of a
parallel loop, I just set the NCPUS to 63, leaving a processor free in case
someone else logged in, etc.

> Incidentally, STREAM is not entirely out of cache. It accesses memory
> with unit stride, so there's quite a lot of benefit from long cache
> lines.

Longer cache lines decrease the coherency "snooping" rate that is required
to support a given data bandwidth. SPARC systems use a 64-byte cache block,
so the Starfire's peak data rate of 10,667 MBps requires a coherency
bandwidth of 10,667/64 = 167 million snoops per second.

If my memory serves me correctly, these are the current cache block sizes
for other systems:

128 bytes SGI R10000
64 bytes Sun UltraSPARC (UPA), DEC/Alpha
32 bytes HP 7200/8000, Intel/Pentuium

A study at http://http.cs.berkeley.edu/~dasgupta/paper/rep/rep.html
suggests that a 64 byte block size may be an optimum block size for SPEC95.

Increasing the cache block size by 2x -- as SGI has done in their recent
generation -- saves a factor of two in coherency traffic for a given data
rate. I suspect all these block sizes will slowly creep up over time.

Implementing a high _coherency_ bandwidth while still keeping a reasonable
latency is the hard part of implementing a cache-coherent SMP. This is
especially true for sharing of modified data -- i.e. where _my_ cache miss
has to be satisfied from _your_ cache, because your cache has the most
recently modified data. Data bandwidth is easy in comparison.

Mark W Brehob

unread,

Mar 31, 1997, 3:00:00 AM3/31/97

to

Alan Charlesworth (al...@west.sun.com) wrote:
: In article <5hei7l$c4v$1...@joe.rice.edu>, pre...@cs.rice.edu (Preston
: Briggs) wrote:

: Longer cache lines decrease the coherency "snooping" rate that is required

: to support a given data bandwidth. SPARC systems use a 64-byte cache block,
: so the Starfire's peak data rate of 10,667 MBps requires a coherency
: bandwidth of 10,667/64 = 167 million snoops per second.

: If my memory serves me correctly, these are the current cache block sizes
: for other systems:

: 128 bytes SGI R10000
: 64 bytes Sun UltraSPARC (UPA), DEC/Alpha
: 32 bytes HP 7200/8000, Intel/Pentuium

: A study at http://http.cs.berkeley.edu/~dasgupta/paper/rep/rep.html
: suggests that a 64 byte block size may be an optimum block size for SPEC95.

: Increasing the cache block size by 2x -- as SGI has done in their recent
: generation -- saves a factor of two in coherency traffic for a given data
: rate. I suspect all these block sizes will slowly creep up over time.

Erb?
False sharing could keep your coherence traffic at the same
level or even higher! This is going to be very
application dependent. In addition raw data bandwidth
will go up (upto 2x!)

: Implementing a high _coherency_ bandwidth while still keeping a reasonable

: latency is the hard part of implementing a cache-coherent SMP. This is
: especially true for sharing of modified data -- i.e. where _my_ cache miss
: has to be satisfied from _your_ cache, because your cache has the most
: recently modified data. Data bandwidth is easy in comparison.

True, but longer cachelines will increase the number of
"dirty cache hits"(false sharing). Not to mention over doubling the
bandwidth of those cases.

My point is that the issues are not simple, not even
for coherancy traffic.
(I'm fairly sure you were aware of that but I thought
I'd clarify)

: --

: Alan Charlesworth
: Business Systems Diivision Engineering
: Sun Microsystems

Mark Brehob
Speaking only for myself.

Hugh LaMaster

unread,

Apr 4, 1997, 3:00:00 AM4/4/97

to

Mark W Brehob wrote:
>
> Alan Charlesworth (al...@west.sun.com) wrote:
> : In article <5hei7l$c4v$1...@joe.rice.edu>, pre...@cs.rice.edu (Preston
> : Briggs) wrote:
>
> : Longer cache lines decrease the coherency "snooping"
rate that is required
> : to support a given data bandwidth. SPARC systems use a
64-byte cache block,
> : so the Starfire's peak data rate of 10,667 MBps requires a coherency
> : bandwidth of 10,667/64 = 167 million snoops per second.
>
> : If my memory serves me correctly, these are the current
cache block sizes
> : for other systems:
>
> : 128 bytes SGI R10000
> : 64 bytes Sun UltraSPARC (UPA), DEC/Alpha
> : 32 bytes HP 7200/8000, Intel/Pentuium
>
> : A study at http://http.cs.berkeley.edu/~dasgupta/paper/rep/rep.html
> : suggests that a 64 byte block size may be an optimum
block size for SPEC95.

http://http.cs.berkeley.edu/~dasgupta/paper/rep/rep.html

I think this study confirmed that smallish L1 caches
are generally optimum at 32B and somewhat larger L2 caches
are optimum at 64B. Today's 1-4 MB L2/L3 cache sizes
were not studied. YMMV depending on direct-mapped vs.
set-associative caches with 1, 2, or 3 levels of cache
of various sizes with various access times. The study
used (32-bit) SPARC binaries as the for the sample code.
It would be interesting to redo the study using a 64-bit
machine in 64-bit mode.

One of the interesting things about this study is that
it suggested that a small, 16KB, 2-way set-associative
cache may be "sufficient", attention then to be given
to the L2 cache (and main memory). [On a 64-bit machine,
the "sufficient" point might move to 32KB, I would guess.]
[Some previous studies suggested a direct-mapped L1 cache
twice as big would be equivalent, but this study found
in favor of a smaller 2-way set-associativity as the optimum.]

> : Increasing the cache block size by 2x -- as SGI
has done in their recent
> : generation -- saves a factor of two in coherency
traffic for a given data
> : rate. I suspect all these block sizes will slowly
creep up over time.
>
> Erb?
> False sharing could keep your coherence traffic at the same
> level or even higher! This is going to be very
> application dependent. In addition raw data bandwidth
> will go up (upto 2x!)

Applications sometimes show sharp knees as a
function of data cache size. When you average
a bunch of them together, you get some kind of
smooth curve.

> : Implementing a high _coherency_ bandwidth while
still keeping a reasonable
> : latency is the hard part of implementing a cache-coherent SMP.
This is
> : especially true for sharing of modified data --
i.e. where _my_ cache miss
> : has to be satisfied from _your_ cache, because your cache has
the most
> : recently modified data. Data bandwidth is easy in comparison.

Data bandwidth is rarely "easy", but, the point about
coherency bus traffic stands.

As for data, there is a tradeoff between block size
and wasted bandwidth. For a test case, one might
consider a parallelized 3D FFT job. On a system
where there is only a small penalty for non-unit
strides, one could simply parallelize each step and
proceed as-is. On a machine which for cache or
vector-related reasons performs much better with
unit strides, one might transpose the arrays when
the direction changes. Then, the performance may
be limited by how fast the system can transpose,
and, what effect that has on, for example, the
available system-bus capabilities. The difference
between a "good" design where the transpose adds
a few percent to the overall design, and a bad design,
with unmerciful cache and TLB thrashing, and bus
coherency traffic, may be huge for some jobs.
There are many other similar examples. In many
numerical simulation applications, if there is
a big penalty for non-unit stride, then transposes
must be done periodically. Either way, there will
definitely be wasted data bandwidth, and possibly
lots of coherency traffic, TLB reloads, etc.

> True, but longer cachelines will increase the number of
> "dirty cache hits"(false sharing). Not to mention over doubling the
> bandwidth of those cases.

Not to mention.

> My point is that the issues are not simple, not even
> for coherancy traffic.
> (I'm fairly sure you were aware of that but I thought
> I'd clarify)

Some of the comments appear to be more oriented
towards, e.g., database applications, which seem to be
well-behaved with respect to large caches. Not all
applications are, and then things get very complex.

--
Hugh LaMaster, M/S 258-5, ASCII Email:
hlam...@mail.arc.nasa.gov
NASA Ames Research Center Or: lama...@nas.nasa.gov
Moffett Field, CA 94035-1000 No Junkmail: USC 18 section 2701
Phone: 415/604-1056 Disclaimer: Unofficial, personal
*opinion*.

Alan Charlesworth

unread,

Apr 5, 1997, 3:00:00 AM4/5/97

to

In article <334552...@nas.nasa.gov>, Hugh LaMaster

<lama...@nas.nasa.gov> wrote:
>
> As for data, there is a tradeoff between block size
> and wasted bandwidth. For a test case, one might
> consider a parallelized 3D FFT job. On a system
> where there is only a small penalty for non-unit
> strides, one could simply parallelize each step and
> proceed as-is. On a machine which for cache or
> vector-related reasons performs much better with
> unit strides, one might transpose the arrays when
> the direction changes. Then, the performance may
> be limited by how fast the system can transpose,
> and, what effect that has on, for example, the
> available system-bus capabilities. The difference
> between a "good" design where the transpose adds
> a few percent to the overall design, and a bad design,
> with unmerciful cache and TLB thrashing, and bus
> coherency traffic, may be huge for some jobs.
> There are many other similar examples. In many
> numerical simulation applications, if there is
> a big penalty for non-unit stride, then transposes
> must be done periodically. Either way, there will
> definitely be wasted data bandwidth, and possibly
> lots of coherency traffic, TLB reloads, etc.
> >

> --
> Hugh LaMaster, M/S 258-5, ASCII Email:
> hlam...@mail.arc.nasa.gov
> NASA Ames Research Center Or: lama...@nas.nasa.gov
> Moffett Field, CA 94035-1000 No Junkmail: USC 18 section 2701
> Phone: 415/604-1056 Disclaimer: Unofficial, personal
> *opinion*.

Alas, one person's wasted _data_ bandwidth is another person's saving in
_coherency_ bandwidth. If cache-coherent systems checked coherency on a
doubleword (8-byte) basis, then non-unit strides wouldn't waste data
bandwidth. But, this would require 8-times the coherency bandwidth over an
implementation that used 64-byte coherency blocks.

Today, given large pin-count ASICs from which to make routers, it is much
easier to increase the data bandwidth in a system by 8x in trade for
reducing the coherency traffic by 8x. Packets on the data interconnect are
independent of each other, while the whole point of coherency traffic is to
find out where in the system the current copy of a cache line is.

--
Alan Charlesworth
Business Systems Division Engineering
Sun Microsystems

Hugh LaMaster

unread,

Apr 7, 1997, 3:00:00 AM4/7/97

to

Alan Charlesworth wrote:

> Alas, one person's wasted _data_ bandwidth is another
person's saving in
> _coherency_ bandwidth.

"Well, yes, that is the trick isn't it? ... It's going
to cost you extra ..."

If cache-coherent systems checked
coherency on a
> doubleword (8-byte) basis, then non-unit strides wouldn't
waste data
> bandwidth. But, this would require 8-times the coherency
bandwidth over an
> implementation that used 64-byte coherency blocks.

But for this particular type of "code fragment" -
that is - for doing array transposes - there would be
little difference in coherency traffic, because with
large non-unit strides, there are going to be minimal
cache hits. That is the problem.

Yes, once you get the array transposed, and all your
array accesses are unit-stride, then there is substantial
savings.

So, for this type of application (one which does transposes
every so often), what is the optimum? Then, there are the
applications where doing transposes doesn't buy you anything,
and you may have lots of gathers/scatters.

My point is this: if you want a processor to be generally
suitable for (medium-to-large-scale) engineering&scientific
problems, the processor has to be able to handle a certain
percentage of non-unit strides & gathers-scatters. If the
processor slows to a crawl in such cases, then it isn't
suitable.

A pure guess on my part would be that the typical range
for (modern, well-structured) applications these days
would be that 5-30% of memory accesses would be via
non-unit strides on large arrays. IMHO, this range
should be handled somewhat gracefully by the TLB/cache/bus.
Anybody else out there care to venture a guess on what
these numbers look like these days?

Glen Clark

unread,

Apr 12, 1997, 3:00:00 AM4/12/97

to Hugh LaMaster

Hugh LaMaster wrote:

> Applications sometimes show sharp knees as a
> function of data cache size. When you average
> a bunch of them together, you get some kind of
> smooth curve.

I would have called it a step-function rather than a knee,
but I won't quibble over terminology.

Does anyone know of a cache probe tool with which you can tell
where you are with reference to the next discontinuity? If you
have a program that runs in, say 252K of cache and it has 256K to
work in, all (or most) access is local and execution will fly.
But if the program requires 257K to run, performance can fall
by almmost half as the system spends much of its time paging
things in and out of L1.

If I'm over the threshold by 40%, there's little chance you can
shrink the program enough to get under it.(Or if you're already
running fully in L1, there no benefit in further size reduction.)
It's when you're over by 8% that there's the chance of getting it
to "fit in the box" by careful rewrite.

At present, the only way we have to tell whether we're over or
under and by how much is to launch multiple, similar programs
of different sizes and to plot their execution times and to look
for the discontinuity. This gets the job done, but it rather
inelegant.

I have to think that someone, somewhere has addressed this point
and has written a utility to collect statistics on cache
utilizations. Has anyone ever seen such a tool?

Thanks.

--
Glen Clark
gl...@clarkcom.com

Zalman Stern

unread,

Apr 14, 1997, 3:00:00 AM4/14/97

to

Glen Clark (gl...@clarkcom.com) wrote:
: Does anyone know of a cache probe tool with which you can tell

: where you are with reference to the next discontinuity?

The only way I can see this being done is to run the program in a simulator
either doing multiple runs with different cache geometries and sizes, or
collecting a full memory trace and then running the trace through a cache
simulator. (The advantage of the second approach being that one might be
able to collect the trace on real hardware.) I'm think the MIPS pixie
tool could collect a complete memory trace. I expect such a feature was
rarely used because memory traces are quite large.

-Z-

Eugene Miya

unread,

Apr 16, 1997, 3:00:00 AM4/16/97

to

Hum, I go travel a while and catch up....

In article <1997Mar17.085923.1@eisner>,

Larry Kilgallen <kilg...@eisner.decus.org> wrote:
>It would seem that those who seek the highest possible speed machine
>must allow themselves flexibility in some other areas.

Time vs. storage trade-off.

>Certainly, those who purchased the first Crays did not have the luxury of

>specifying their (various) previous favorite operating systems
>to run on the Cray.

"Operating system? What's an operating system?"
Don't laugh (I'm ROTFL), for the longest period of time
(should this be in comp.unix.cray?) you bought a Cray-1, meant:
you wrote your own OS. That's what REAL SITES did. I seem to recall
something like 5-6 different OSes, and recall some speculation in
certain circles that SGI purchased CRI for spite value because
to get even with certain people (but I can't believe that).

Then came the Cray-2. None of the old OSes would have run on it (CAL or
LRLTRAN memory address limitations). 2-3 sites were in the running for
the first one (the first, real one, what ever that meant).
At Ames, our CFD users wanted VMS. That was ROTFL value.
So Creon and I pushed for Unix. Dick Watson pushed for a new OS, NLTSS,
which was needed work. Creon and I did consider 4.1c BSD and faking
the VM, but we grew to think that was "pushing it." Mark Aaker pushed
TCP/IP (Dick pushed LINCS). Then there a bunch of guys willing to try
and convert CTSS (that was a laugh and we wished them well [eliminate
the competition]).

What a great time. The predictability and forecasting made it easy to
out-guess your managers.

>Cray, on the other hand, had to provide Fortran support to lure
>those with existing code. In the intervening years we have come
>a long way in language standardization, and anyone who is still
>programming exclusively in non-portable assembly language will
>not even be able to move between two vendors who each call their
>operating system "Unix".

Lure isn't the right word. Speed was and still is the lure.
CRI chose to rewrite the cft compiler in Pascal for cft77.
The Japanese supercomputer vendors provided incentive as well as CRI
internal maintenance costs made it advantageous to upgrade the language
and compiler. Fort77 loops are easier to vectorize than Fort66 loops.

Your level of seriousness being a supercomputer site would reflect
on your willingness to achieve speed. Assembly language or not.

God did not have an installed base.

What's this got to do with SUN and benchmarks, guys?

Hugh LaMaster

unread,

Apr 17, 1997, 3:00:00 AM4/17/97

to Eugene Miya

Eugene Miya wrote:

> Then came the Cray-2. None of the old OSes would have run on it (CAL or
> LRLTRAN memory address limitations). 2-3 sites were in the running for
> the first one (the first, real one, what ever that meant).
> At Ames, our CFD users wanted VMS. That was ROTFL value.

^^^^^^^^^^^^^^^^^^^^^^^^
Yes, ... I remember it well. It would have been an interesting
exercise without dynamic address translation hardware. Likewise
for BSD Unix.

> So Creon and I pushed for Unix.

The non-VM heritage of Unix, the fact that some of the kernel
and all the utilities were written in C, and that it had
already been ported to odd word-size machines, made it an
easier port than most of the alternatives. VMS made obvious
assumptions about the underlying hardware, for example, as
did most other OS's.

Dick Watson pushed for a new OS, NLTSS,
> which was needed work.

LTSS/CTSS had some really nice features, actually.
For example, the scheduler understood swapping - an
important feature in a big iron machine intended for
large-memory jobs - and the scheduler was priority-queue
based. I/O was really fast for the time, although
the filesystem and directory structures were rather
"unfriendly". But, I'm (still) not convinced that it
wouldn't have worked, maybe even better at first -
the early versions of mainframe Unix were not all that
wonderful.

Creon and I did consider 4.1c BSD and faking
> the VM, but we grew to think that was "pushing it." Mark Aaker pushed
> TCP/IP (Dick pushed LINCS).

You forgot to mention the DECnet faction! Not to mention
that XNS was developing a very considerable following in
industry. IMHO, the choice of TCP/IP turned out to be of
great historical importance. [Because of 4.x BSD Unix,
it also helped bring Unix along for the ride. Users were
used to running with at least two OSes at the time, anyway.]
But until TCP/IP was standardized, it was murder making
them communicate - networking was an O(N^^2) problem,
with every site writing its own machine-machine gateways.
TCP/IP made it an O(N) problem.

Then there a bunch of guys willing to try
> and convert CTSS (that was a laugh and we wished them well [eliminate
> the competition]).

So, why was the port so difficult? Whoops! No sense
bringing up a 15-year-old argument. OK, Unix won.
[Or was it Wintel?]

> Lure isn't the right word. Speed was and still is the lure.

20 years ago, the received wisdom was that computers were
already so fast and cheap that programming costs dominated
everything, and, therefore, raw speed and price/performance
didn't matter much. What many people didn't realize then
was that for every major improvement in speed, 4X-10X say,
new applications could appear which were impractical before.
So, it wasn't just a question of doing today's computations
faster and cheaper. Interestingly, many people still think
that way, despite the history of computing and the ever-
expanding new classes of computing that have appeared.
Who could imagine, then [besides Doug Englebart], that
4-year-olds would be accessing "PCs" [faster than CDC7600's]
using color GUIs. Likewise, on big iron, each speed
improvement allows new capabilities. We still have a long
ways to go before, for example, four-hour worldwide mesoscale
weather forecasts are practical, but every big jump gets us
that much closer. Unfortunately, new generations of high-end
hardware are getting ever more expensive to develop.
[#include "Killer micros, parallelism, and all that." &etc.]

> Your level of seriousness being a supercomputer site would reflect
> on your willingness to achieve speed. Assembly language or not.
>
> God did not have an installed base.
>
> What's this got to do with SUN and benchmarks, guys?

I don't know.

Eugene Miya

unread,

Apr 18, 1997, 3:00:00 AM4/18/97

to

In article <335661...@nas.nasa.gov>,

Hugh LaMaster <lama...@nas.nasa.gov> wrote:
>> At Ames, our CFD users wanted VMS. That was ROTFL value.
> ^^^^^^^^^^^^^^^^^^^^^^^^
>Yes, ... I remember it well. It would have been an interesting
>exercise without dynamic address translation hardware. Likewise
>for BSD Unix.

But it was doable. We had just found Joy's paper on Berkeley Unix.

>LTSS/CTSS had some really nice features, actually.

More work of the type done on NLTSS is needed.

>But, I'm (still) not convinced that it
>wouldn't have worked, maybe even better at first -
>the early versions of mainframe Unix were not all that
>wonderful.

One of the BTL guys noted that Unix probably set OS research back ten
years or so. NLTSS ran on one of the LLNL Crays for a year or so.
We only learn by trying and hacking (trial and error).

>You forgot to mention the DECnet faction!

Still breathing, but until when?

>Not to mention that XNS was developing a very considerable following in
>industry.

That was SGI's initial mistake (SV was the other). They got over it.

>IMHO, the choice of TCP/IP turned out to be of
>great historical importance. [Because of 4.x BSD Unix,
>it also helped bring Unix along for the ride. Users were
>used to running with at least two OSes at the time, anyway.]

Vint Cerf apologized for 32-bit addressing in TCP in February.
Popek made the same mistake in LOCUS and Rashid in Mach.
The importance was really: the ARPAnet lost DEC-20s, they needed a new
machine, common software helped.

>But until TCP/IP was standardized, it was murder making
>them communicate - networking was an O(N^^2) problem,
>with every site writing its own machine-machine gateways.
>TCP/IP made it an O(N) problem.

We still really don't have a "standard" nor do I think we should.
A it's a de facto standard not a de jurs one. That's what killed X.25
and other ISO developments. Working software had value.
It wasn't vapor.

One BSDism which is still not used is protocol familes. The NLTSS guys
decided to hack the kernel (a mistake) for LINCS. It will be
interesting to see if families really evolves over time.

>>CTSS

>So, why was the port so difficult? Whoops! No sense
>bringing up a 15-year-old argument. OK, Unix won.
>[Or was it Wintel?]

The Cray-1/Cray-X-MP had hardware addressing limitations.
The Cray-2 dropped B and T registers. Where was all that OS code going to
go? There was a local memory now (16 KW, not a cache) since dropped on the
Cray-4. CTSS didn't account for that new feature.
I don't believe that military analogies work for economics nor
biological analogies for computers. Wintel will "win" unless something
comes along to challenge it (and it doesn't cause problems).

>20 years ago, the received wisdom was that computers were
>already so fast and cheap that programming costs dominated
>everything, and, therefore, raw speed and price/performance
>didn't matter much. What many people didn't realize then
>was that for every major improvement in speed, 4X-10X say,
>new applications could appear which were impractical before.

Progress is one dimension, however, reading about expectations of
50x speed is another (6600 vs. its competitors and predecessors).

>So, it wasn't just a question of doing today's computations
>faster and cheaper. Interestingly, many people still think
>that way, despite the history of computing and the ever-
>expanding new classes of computing that have appeared.
>Who could imagine, then [besides Doug Englebart], that
>4-year-olds would be accessing "PCs" [faster than CDC7600's]

I give Kay credit.

>using color GUIs. Likewise, on big iron, each speed
>improvement allows new capabilities. We still have a long
>ways to go before, for example, four-hour worldwide mesoscale
>weather forecasts are practical, but every big jump gets us
>that much closer. Unfortunately, new generations of high-end
>hardware are getting ever more expensive to develop.
>[#include "Killer micros, parallelism, and all that." &etc.]

Development requires that we develop the infrastructure to develop the
new big machines. The US has not demonstrated that interest except a
few conferences, meetings, and lists. We are more than willing to apply
machines, but it's the infrastructure which is key. We can't sell what
we don't have. Hence people grab at straws (e.g., Top-500 or Gunter's list).

>> What's this got to do with SUN and benchmarks, guys?
>I don't know.

We'll blame those guys cross-posting and not using Followup lines
or editing headers.

Glen Clark

unread,

Apr 20, 1997, 3:00:00 AM4/20/97

to Eugene Miya

Eugene Miya wrote:

> One of the BTL guys noted that Unix probably set OS research back ten
> years or so. NLTSS ran on one of the LLNL Crays for a year or so.
> We only learn by trying and hacking (trial and error).

I would be interested in seeing some supporting thoughts for this
conclusion.

--
Glen Clark
gl...@clarkcom.com
State College, PA

Eugene Miya

unread,

Apr 22, 1997, 3:00:00 AM4/22/97

to

Eugene Miya wrote:
>> One of the BTL guys noted that Unix probably set OS research back ten
>> years or so. NLTSS ran on one of the LLNL Crays for a year or so.
>> We only learn by trying and hacking (trial and error).

In article <335A60AD...@clarkcom.com>,

Glen Clark <gl...@clarkcom.com> wrote:
>I would be interested in seeing some supporting thoughts for this
>conclusion.

I am not quite clear one which point: setting back OS research or hacking.

On OS research:
Comp.os.research is a slightly better group for that discussion (Darrell
moderates it here from this same host). Thompson, Ritchie, Pike, and the
other Labs people have gone on record saying the above. They don't do
Unix anymore, they did Plan 9, and they are doing Inferno now
(COMPCON'97 session). This is a problem with technology transfer into
products and industry. Ken refuses to talk about Unix, that's ancient
history (he's rather talk about getting a ride in a Mig-31, and he's right).

The EXACT same thing was said about the Ada programming language and
programming language development, PLs and OSes only differ in
developments in that OSes have a longer period to change than PLs.
Ref: "The Next 700 Programming Languages."

That NLTSS ran a year or so, that's documented by the guys who worked on it.

I think the future of OSes is currently in the hands of a 41 year old guy
up in Redmond, WA, regardless of what Linus does with Linux, Stallman does
with Gnu, etc. I'm myself not clear on all the implications of this.
I think this, however: if I were a CS prof teaching operating systems,
I'd be worried. The issues aren't ones of technical merit. It's an
issue of who drives the computer industry? PC customers think they do.
And they do (along with BG). I wonder what that's going to mean to
SGI&&CRI and other firms (IBM)? Where are MVS programmers going to come from?
Do the university departments who produce the students want to start
migrating over to NT? Source code?
Who fixes the FAA air traffic control system when it's software breaks?
Who fixes the IRS Unisys systems?
I wonder if the petaflops machine is going to be PC compatible.
I wonder if the petaflops machine will run NT.

In the begin was Bill. And Bill looked upon the world and said,
"It's okay with me." And Bill can say and do nothing wrong.
And Bill has the world by the balls. But he's no dummy; he is aware
of the faults of his computer predecessors before him (Watson: How many
machines?). He doesn't want to kill a goose which lays golden eggs.
He will sell no feature before its time.

Our society is not currently capable of comprehending bugs. It has a
rough idea, but a very crude one. I expect planes to fall out of the sky.
I really do. The problem is that those planes kill relatively innocent
people, not software developers (that's would be evolution by
natural selection). We are currently more interested in making money
and staying on track that attempting to do the right job.

Remember the title of Andy Groves' book:
Only the Paranoid Survive
Bill is paranoid. The question is: how paranoid is the reader?

On hacking, here's a quote:

"All of us knew we were looking at something important, but I'm amazed
at how fast the knowledge has spread out into the general population.
As Xerox and IBM move to package everything, I wonder if there's still
going to be room for the hackers. That's where this country is way
ahead of the Japanese. They have such a highly structured society that
they don't have any place in it for misfit hackers. But these are
the people coming up with the truly creative ideas. This is where you
get the real advances."
--Jim Crutchfield

Dan Hildebrand

unread,

Apr 23, 1997, 3:00:00 AM4/23/97

to

In article <handleym-230...@handma.apple.com>,
Maynard Handley <hand...@apple.com> wrote:
>
>On NT. I think different things are being conflated here. NT is an OS,
>sure, but for many purposes what matters is not the OS but the API. If the
>API presented is Win32, whether what's under it is NT, CE or 95 does not
>matter. In particular, if what is under it is GNU Hurd running a Win32
>server, for many purposes that should not matter. As far as academia is
>concerned, I can't see much of a problem. All that changes is the API
>shifts from Posix to Win32 which, not being much of a Posix fan, I'm
>certainly not going to cry over.

Just as POSIX codified the UNIX API, allowing that API to be implemented on
architectures other than the traditional UNIX monolithic kernel, and to
assume behavioral attributes other than that of traditional UNIX, the Win32
API can also be implemented on OS architectures other than Win95 or WinNT.
I wrote a paper a while back arguing this case regarding implementing the
Win32 API on a realtime OS vs. implementing realtime as extensions to NT
(http://www.qnx.com/whitepapers/qnxwin32.html).
--
Dan Hildebrand (da...@qnx.com) QNX Software Systems, Ltd.
http://www.qnx.com/~danh 175 Terence Matthews
phone: +1 (613) 591-0931 Kanata, Ontario, Canada
fax: +1 (613) 591-3579 K2M 1W8

Hugh LaMaster

unread,

Apr 23, 1997, 3:00:00 AM4/23/97

to

Maynard Handley wrote:
>
> In article <5jjfhe$5...@darkstar.ucsc.edu>, eug...@cse.ucsc.edu (Eugene

> Miya) wrote:
>
> > Eugene Miya wrote:
> > >> One of the BTL guys noted that Unix probably set OS research back ten
> > >> years or so.

It is interesting to compare what features Multics, TSS, VMS,
and LTSS/CTSS had, relative to [any publically available] Unix,
on, say, January 1, 1984.

Besides Unix lacking things like (misnomer) "SMP", high(er)
performance file systems, robust scheduling/paging/swapping, etc.,
there were also things related to security and to reliability -
things not always appreciated at the time in the OS and CS research
world, until security and reliability became "research" topics.
At the time, the Unix subculture was very antagonistic
to things perceived as being oriented towards MIS/
data-processing, even when they were legitimate single-user
requirements which were of great benefit to small systems.

On the whole, though, I'm not sure Unix hurt research.
While Unix was not then (and was not intended to be)
a bleeding-edge research OS, or, a big-iron OS, source
code to 4.x BSD was available. That helped many aspects
of OS research, because of the relative portability of Unix,
and the familiarity at many different universities and labs.
Many projects were able to take 4.3 BSD Unix and hack on
certain pieces. Hack some system calls into the kernel
to support whatever they were doing. So, I think "ten years"
is too general. I think the triumph of Unix did somewhat
damage scalability, performance, reliability, etc.,
both in the commercial marketplace and in research,
but I think it speeded progress in many other areas.

NLTSS ran on one of the LLNL Crays for a year or so.
> > >> We only learn by trying and hacking (trial and error).

I would be interesting to hear if any important OS lessons were
learned from LTSS/CTSS/NLTSS. I used an early version, and liked
the way the system facilitated debugging.

> On UNIX. I question that the problem was purely UNIX qua Operating System.
> I think the problem was more a set of attitudes that came along with UNIX
> of basically refusing to believe that anything could possibly be wrong
> with UNIX (even though the people involved knew almost nothing of any
> other OS). I suspect this was more the result of social factors like many
> people learning UNIX as their first OS, having a chance to mess with the
> source code thus feeling personally connected to it, etc. As such I think
> the problem was more that UNIX became pretty much the only OS that
> mattered in universities.

I think this is a problem much bigger than Unix. Unix just
happened to be the instrument in this case.

Unix, VMS, MVS, DOS, Windows, the Mac OS, and many other OSes,
have all been the subject of religious cults. In another
posting, Eugene mentioned Alan Kay. He had some interesting
thoughts on this process: "stories", vs "logical arguments", and
"systems dynamics". See:

http://www.research.apple.com/people/features/congress_speech.html

It is difficult to think about problems in a fundamental way,
and education, even (undergraduate-level) engineering and science
education, doesn't normally *require* students to develop much
facility or experience in such thinking.

> > I think the future of OSes is currently in the hands of a 41 year old guy
> > up in Redmond, WA, regardless of what Linus does with Linux, Stallman does
> > with Gnu, etc. I'm myself not clear on all the implications of this.
> > I think this, however: if I were a CS prof teaching operating systems,
> > I'd be worried. The issues aren't ones of technical merit. It's an
> > issue of who drives the computer industry?

At risk of sounding totally square, I have to object that this
is only a CS prof permitting his students to learn stories
instead of ideas, and, in that case, such a prof isn't teaching
computer *science* or *engineering*, but, (merely) vocational
training. Drafting, instead of architecture. Sure, anybody
can draw up plans, and, anybody can install the latest vendor OS
according to the recipe. [Not that I have anything against
vocational training. Everyone should have some, even scientists.]
But [Kay] learning stories is not the same thing as learning
to logically analyze ideas or complex systems. A CS [or any
other _______ *science*] prof, should not waste time on unnecessary
vocational education. There are thousands of private and corporate
training programs available where one can learn the latest
hot commercial topics at any time.

PC customers think they
do.
> > And they do (along with BG). I wonder what that's going to mean to
> > SGI&&CRI and other firms (IBM)? Where are MVS programmers going to come from?
> > Do the university departments who produce the students want to start
> > migrating over to NT?

Why would anyone who is not a professional NT programmer
need to learn NT internals?

Source code?
> > Who fixes the FAA air traffic control system when it's software breaks?
> > Who fixes the IRS Unisys systems?

Seriously, even on the training side, I wasn't aware that there
was a shortage of people trained in non-Microsoft non-Unix OSes.
If anything, there seems to be a surplus.

> > I wonder if the petaflops machine is going to be PC compatible.
> > I wonder if the petaflops machine will run NT.

>

> On NT. I think different things are being conflated here. NT is an OS,
> sure, but for many purposes what matters is not the OS but the API. If the
> API presented is Win32, whether what's under it is NT, CE or 95 does not
> matter.

True. And, there are many wonderful consumer-level programs
now being written to the Win32 API.

In particular, if what is under it is GNU Hurd running a Win32
> server, for many purposes that should not matter. As far as academia is
> concerned, I can't see much of a problem. All that changes is the API
> shifts from Posix to Win32 which, not being much of a Posix fan, I'm
> certainly not going to cry over.

I'm afraid this seems to be an example of what Eugene was
afraid of. "Just use Win32. What is wrong with it?" BTW,
there have been some thoughtful comparisons of, for example,
OpenGL and Direct3D. A competitive marketplace of ideas
usually produces better solutions than a monopoly on ideas.

> As regards planes falling out the sky, all I can say is I think you are
> being rather unfair to airplane engineers who do a pretty damn good job of
> building safe machines, in spite of continual whining from a know-nothing
> press, willing to second-guess the cause of any problem 30 seconds after a
> plane goes down.

Recent statistics seem to indicate that modern-design large
jet transports offer unprecedented safety per mile of travel.
But, speaking generally, in recent years there have been
many examples of people who should have known better trusting
software far beyond what they should have. But, this brings
up a huge discussion about software safety which should take
place in a different newsgroup.

she...@pluto.njcc.com

unread,

Apr 23, 1997, 3:00:00 AM4/23/97

to

<snip>

> I would be interesting to hear if any important OS lessons were
> learned from LTSS/CTSS/NLTSS. I used an early version, and liked

> the way the system facilitated debugging.<snip>

drumrole, please.
what does LTSS/CTSS/NLTSS stand for? (all one thing?)
would anyone care to mention what the relative
flaws and merits of it (each) are?
(or, if not, some pointer(s) to where such information might reside.)
H.L., would you care to elaborate on the debugging features?
H.L., any other (dis)likes about the system(s)? anyone else?

...;;;;::::...

Maynard Handley

unread,

Apr 23, 1997, 3:00:00 AM4/23/97

to

In article <5jjfhe$5...@darkstar.ucsc.edu>, eug...@cse.ucsc.edu (Eugene
Miya) wrote:

> Eugene Miya wrote:
> >> One of the BTL guys noted that Unix probably set OS research back ten

> >> years or so. NLTSS ran on one of the LLNL Crays for a year or so.

> >> We only learn by trying and hacking (trial and error).

> On OS research:

> Comp.os.research is a slightly better group for that discussion (Darrell
> moderates it here from this same host). Thompson, Ritchie, Pike, and the
> other Labs people have gone on record saying the above. They don't do
> Unix anymore, they did Plan 9, and they are doing Inferno now
> (COMPCON'97 session). This is a problem with technology transfer into
> products and industry. Ken refuses to talk about Unix, that's ancient
> history (he's rather talk about getting a ride in a Mig-31, and he's right).
>

On UNIX. I question that the problem was purely UNIX qua Operating System.

I think the problem was more a set of attitudes that came along with UNIX
of basically refusing to believe that anything could possibly be wrong
with UNIX (even though the people involved knew almost nothing of any
other OS). I suspect this was more the result of social factors like many
people learning UNIX as their first OS, having a chance to mess with the
source code thus feeling personally connected to it, etc. As such I think
the problem was more that UNIX became pretty much the only OS that
mattered in universities.

> I think the future of OSes is currently in the hands of a 41 year old guy

> up in Redmond, WA, regardless of what Linus does with Linux, Stallman does
> with Gnu, etc. I'm myself not clear on all the implications of this.
> I think this, however: if I were a CS prof teaching operating systems,
> I'd be worried. The issues aren't ones of technical merit. It's an

> issue of who drives the computer industry? PC customers think they do.

> And they do (along with BG). I wonder what that's going to mean to
> SGI&&CRI and other firms (IBM)? Where are MVS programmers going to come from?
> Do the university departments who produce the students want to start

> migrating over to NT? Source code?

> Who fixes the FAA air traffic control system when it's software breaks?
> Who fixes the IRS Unisys systems?

> I wonder if the petaflops machine is going to be PC compatible.
> I wonder if the petaflops machine will run NT.

On NT. I think different things are being conflated here. NT is an OS,
sure, but for many purposes what matters is not the OS but the API. If the
API presented is Win32, whether what's under it is NT, CE or 95 does not

matter. In particular, if what is under it is GNU Hurd running a Win32

server, for many purposes that should not matter. As far as academia is
concerned, I can't see much of a problem. All that changes is the API
shifts from Posix to Win32 which, not being much of a Posix fan, I'm
certainly not going to cry over.

As regards planes falling out the sky, all I can say is I think you are
being rather unfair to airplane engineers who do a pretty damn good job of
building safe machines, in spite of continual whining from a know-nothing
press, willing to second-guess the cause of any problem 30 seconds after a
plane goes down.

Maynard

--
My opinion only

Brian Forney

unread,

Apr 23, 1997, 3:00:00 AM4/23/97

to

In article <335EAE...@nas.nasa.gov>, Hugh LaMaster
<lama...@nas.nasa.gov> wrote:

> Maynard Handley wrote:
> >
> > In article <5jjfhe$5...@darkstar.ucsc.edu>, eug...@cse.ucsc.edu (Eugene
> > Miya) wrote:
> >
> > > Eugene Miya wrote:

[snip, snip]

> > > I think the future of OSes is currently in the hands of a 41 year old guy
> > > up in Redmond, WA, regardless of what Linus does with Linux, Stallman does
> > > with Gnu, etc. I'm myself not clear on all the implications of this.
> > > I think this, however: if I were a CS prof teaching operating systems,
> > > I'd be worried. The issues aren't ones of technical merit. It's an
> > > issue of who drives the computer industry?
>
> At risk of sounding totally square, I have to object that this
> is only a CS prof permitting his students to learn stories
> instead of ideas, and, in that case, such a prof isn't teaching
> computer *science* or *engineering*, but, (merely) vocational
> training. Drafting, instead of architecture. Sure, anybody
> can draw up plans, and, anybody can install the latest vendor OS
> according to the recipe. [Not that I have anything against
> vocational training. Everyone should have some, even scientists.]
> But [Kay] learning stories is not the same thing as learning
> to logically analyze ideas or complex systems. A CS [or any
> other _______ *science*] prof, should not waste time on unnecessary
> vocational education. There are thousands of private and corporate
> training programs available where one can learn the latest
> hot commercial topics at any time.

Unfortunately, university professors seem to think this is an important
topic. I know professors from my undergraduate institution are debating
this topic.

I say teach students the concepts and the fundamentals. Examples from real
world software and hardware systems can be used, but universities are not
votech schools. Maybe assignments in an OS clas will rely upon a certain
OS, but don't force one and only one OS's paradigms on students. If you
need freely available source to teach students, then do so. If you need
the most advanced OS to illustrate a difficult problem, then do so.

Often times UNIX variants happen to be the best for this because there are
so many flavors and free versions exist. (How would I make a class add
real-time support to, say, the MacOS? Apple doesn't even license source,
and if they did, how would the average university pay for it?)

Brian
--------------
Brian Forney email: bfo...@cray.com
Cray Research, Inc., a Silicon Graphics Company voice: 612.683.5689
655F Lone Oak Dr., Eagan, MN 55121

Dan Hildebrand

unread,

Apr 24, 1997, 3:00:00 AM4/24/97

to

In article <5jm9qm$9...@qnx.com>, Dan Hildebrand <da...@qnx.com> wrote:
>(http://www.qnx.com/whitepapers/qnxwin32.html).

After many questions regarding the inaccessibility of the paper, I owe an
apology for my lack of proofreading. The paper's at:

http://www.qnx.com/whitepaper/qnxwin32.html

Stephen O Gombosi

unread,

Apr 24, 1997, 3:00:00 AM4/24/97

to

In article <335EC2...@pluto.njcc.com>, <she...@pluto.njcc.com> wrote:

>what does LTSS/CTSS/NLTSS stand for? (all one thing?)

Others know much better than I, but:

LTSS - Livermore Time Sharing System
CTSS - Cray Time Sharing System (port of LTSS to Crays)
NLTSS - New Livermore Time..you get the idea. A new OS based on the lessons
learned from LTSS/CTSS.

LTSS was (arguably) the first interactive system for large-scale scientific
computers. It was written primarily in CIVIC (FORTRAN with various extensions).
LTSS was also ported to the CDC STAR machines (it was eventually adopted by
CDC as STAR-OS, which begat VSOS). It's the earliest example that I can think
of of a portable OS, written in a higher-level language. Personally, I
hated it - but that's strictly individual taste (I cut my teeth on CDC Kronos).

CTSS was the OS originally chosen by CRI for the Cray-2, a decision soon
abandoned in favor of Unix (due in large part to Gene and his cronies at Ames).
One could argue that this was a serious marketing error on Cray's part, since
it narrowed the conceptual gap between workstations and supercomputers.

>would anyone care to mention what the relative
>flaws and merits of it (each) are?

Flaws: flat file system. No pipes. Hard limit on number of processes for
a given login, with a process "chain", rather than a tree as in
Unices.

Many of these problems were actually fixed in later versions (there
was HFS code and POSIX-compliant process management code developed)

Many places in the OS (especially file
buffers in various commands) were still coded as though the OS was
going to run on a 7600 (with 65 kwords of SCM).

Good things: DROP FILES!!!!!! This has a lot to do with the debugging features
Hugh liked so much. Sort of checkpoint/restart, swap, and debugging
rolled into one. One of the chief shortcomings of many Unices (with the
exception of Unicos and its descendants (CSOS)) as far as large-scale
computing is concerned is the failure to make provisions for *really*
long-running processes (i.e., processes whose execution times routinely
exceed the MTBF of the system). Expecting the user to rerun such jobs
from the beginning doesn't cut it. This is just part of the Unix
"batch? what's that?" syndrome.

>(or, if not, some pointer(s) to where such information might reside.)
>H.L., would you care to elaborate on the debugging features?
>H.L., any other (dis)likes about the system(s)? anyone else?

I never could figure out what to do in response to the message "Outstanding
Fatal Error". Should I try to fix it, or just give it a standing ovation?

Steve

Andrew Valencia

unread,

Apr 27, 1997, 3:00:00 AM4/27/97

to

In <5jm9qm$9...@qnx.com> da...@qnx.com (Dan Hildebrand) writes:

>...

>I wrote a paper a while back arguing this case regarding implementing the
>Win32 API on a realtime OS vs. implementing realtime as extensions to NT
>(http://www.qnx.com/whitepapers/qnxwin32.html).

^^^^^^^^^^^

This appears to be a typo on Dan's part; the directory is "whitepaper":

http://www.qnx.com/whitepaper/qnxwin32.html

Regards,
Andy Valencia

Victor Yodaiken

unread,

Apr 27, 1997, 3:00:00 AM4/27/97

to

In article <5jjfhe$5...@darkstar.ucsc.edu>,

Eugene Miya <eug...@cse.ucsc.edu> wrote:
>I think the future of OSes is currently in the hands of a 41 year old guy
>up in Redmond, WA, regardless of what Linus does with Linux, Stallman does

Like to bet?

>with Gnu, etc. I'm myself not clear on all the implications of this.
>I think this, however: if I were a CS prof teaching operating systems,
>I'd be worried. The issues aren't ones of technical merit. It's an

Why should I worry? And why is it news that technical merit does not
determine success in the engineering "marketplace" (for lack of a better
term)?

>Where are MVS programmers going to come from?

Retirement.

>Do the university departments who produce the students want to start
>migrating over to NT? Source code?

Why should they?

>Who fixes the FAA air traffic control system when it's software breaks?
>Who fixes the IRS Unisys systems?

Well, if we start teaching only how to use the current market leader
instead of teaching fundamental skills, nobody.

>I wonder if the petaflops machine will run NT.

Not at petaflops.

>In the begin was Bill. And Bill looked upon the world and said,

People used to talk this way about IBM.

>Our society is not currently capable of comprehending bugs. It has a
>rough idea, but a very crude one. I expect planes to fall out of the sky.
>I really do. The problem is that those planes kill relatively innocent

They already do. What happened to the first commercial jet liners?

>people, not software developers (that's would be evolution by
>natural selection). We are currently more interested in making money
>and staying on track that attempting to do the right job.

Which is why we should long for the days when the B6700 and
CDC machines blew away IBM on pure technical merit?

Eugene Miya

unread,

May 9, 1997, 3:00:00 AM5/9/97

to

Just a quickie, I'll get back to this thread when I have time from travelling.

In article <335EAE...@nas.nasa.gov>,
Hugh LaMaster <lama...@nas.nasa.gov> wrote:

>It is interesting to compare what features Multics, TSS, VMS,

>and LTSS/CTSS had, ... ^^^^^^^^
>
>Besides Unix lacking things like ...
^^^^^^^

HUGH! After all these years, you still don't get it! 8^)

The cheapest, fastest, and most reliable components of a computer
are those that aren't there.
--Gordon Bell

Thank god that Gwyn works here (Ames) now. This means that one of these
days you have a real chance to run into Gordon. Like maybe June 24 (she
tells me, at the programming the ILLIAC IV history talk where-ever she
schedules that [she says she has two former programmers? You and Ken?]).
Gordon has learned, Brooks has learned.

The issue has NEVER been what features those above systems had.
It was the REMOVAL of features allowed the one to succeed and finds the
rest winnowing away. It's also why RISC architectures became popular:
simple, minimal.

Get rid of creeping featuritis.

>At the time, the Unix subculture was very antagonistic
>to things perceived as being oriented towards MIS/
>data-processing, even when they were legitimate single-user
>requirements which were of great benefit to small systems.

The MIS/DP community would do well to consider slightly different ways
of doing things. Me, antagonistic? Why no. Let COBOL exist.
Let me get my dinosaur shirt out: oops, needs to get washed.

Three types of U*x users existed: The Type 0: users of devices like
phones: completely transparent. Type 1: those who benefit from OS
portability: same shells or tools or environments across heterogeneous
platforms. And Type 2: those who adopt a software tools U*x philosophy.

Research: Better to review Amoeba, Plan 9, and Inferno. NT: maybe no choice.

I'll get back to this thread when I have time.

Knuth has shown us here how to program intelligibly, but not wisely.
I buy the discipline. I do not buy the result. He has fashioned a
sort of industrial-strength Faberge egg--intricate, wonderfully worked.
refined beyond all ordinary desires, a museum piece from the start.
--M. D. McIlroy
CACM
%T A Literate Program
%V 29
%N 6
%D June 1986
%P 471-483

Greg Pfister

unread,

May 9, 1997, 3:00:00 AM5/9/97

to

Eugene Miya wrote:
>
> Just a quickie, I'll get back to this thread when I have time from travelling.
>
> In article <335EAE...@nas.nasa.gov>,
> Hugh LaMaster <lama...@nas.nasa.gov> wrote:

> >It is interesting to compare what features Multics, TSS, VMS,

> >and LTSS/CTSS had, ... ^^^^^^^^
> >
> >Besides Unix lacking things like ...
> ^^^^^^^
>
> HUGH! After all these years, you still don't get it! 8^)
>
> The cheapest, fastest, and most reliable components of a computer
> are those that aren't there.
> --Gordon Bell

[snip]

> The issue has NEVER been what features those above systems had.
> It was the REMOVAL of features allowed the one to succeed and finds the
> rest winnowing away. It's also why RISC architectures became popular:
> simple, minimal.
>
> Get rid of creeping featuritis.

[snip]

> Three types of U*x users existed: The Type 0: users of devices like
> phones: completely transparent. Type 1: those who benefit from OS
> portability: same shells or tools or environments across heterogeneous
> platforms. And Type 2: those who adopt a software tools U*x philosophy.

Eugene, you're espousing a direction that is in some senses defensible,
but I think wrong. But I'm not saying complicated is good. Read
further.

What do you do with the users who need those features? For example,
take policy-driven load balancing (in MVS, not *ix): User states policy,
like jobs of class A get so much of system until use Z minutes, jobs of
type B get <etc.>, and the system figures it out given job classes.
This is more than just easier to use than *ix "nice," it's required for
shops that run complex job mixes. Major feature. Major complexity.

Aha, guess what, that's in *ix! Through a third party: Computer
Associates.

So, is a *ix system with CA's scheduling stuff added more or less
complex than a system with it built in? Well, both. It's probably not
as integrated with other tools, and can't do some things as well as code
with more access to the innter system guts, but on the other hand by
structurally separating it from the kernel it is undoubtedly simpler to
implement and maintain.

Other example: Compare the complexity of a CISC implementation to, say,
the out-of-order superscalar high clock rate RISC Alpha 21264. Which is
more complicated? Fact is, microarchitectural complexity has now far
outstripped complexity arising just from wierdly-placed bits in an
instruction set. But does the simplicity of RISC make that
implementation easier to manage? Assuradly. That's why new CISCs
(e.g., Intel Pentium Pro) translate into an internal sort-of-RISC and
pull the tricks on that.

My point is that the issue is NOT total number of features. The issue
IS simplicity OF STRUCTURE. It is quite possible to have an immensely
complex system that is comprehensible because it has a straightforward
structure. The problem with MVS (if, my local colleagues would say,
there actually is a problem...) isn't complexity as such. It's rather
that (like recent word processors) it's had features grafted onto a
structure that is no longer able to bear their weight. To get things
comprehensible again it was necessary to start again with a suficient
basic structure -- like that of the original Unix.

Greg Pfister

Eugene Miya

unread,

May 13, 1997, 3:00:00 AM5/13/97

to

In article <handleym-230...@handma.apple.com>,
Maynard Handley <hand...@apple.com> wrote:

>On UNIX. I question that the problem was purely UNIX qua Operating System.
>I think the problem was more a set of attitudes that came along with UNIX
>of basically refusing to believe that anything could possibly be wrong
>with UNIX (even though the people involved knew almost nothing of any
>other OS). I suspect this was more the result of social factors like many
>people learning UNIX as their first OS, having a chance to mess with the
>source code thus feeling personally connected to it, etc. As such I think
>the problem was more that UNIX became pretty much the only OS that
>mattered in universities.

I agree with part of this. The social factors (i.e. religious wars) are
clearly silly. I think the knowledgeable people have moved on. I think
the qualification comes when you say "universities:" you really mean CS
Depts., and at certain universities at that. And later funded projects
got tied to it. However, in their defense, some of the better software
people involved contributed to the development of important applications.

>what matters is not the OS but the API.

I agree to a point. The problem is that people write to OSes, and APIs
are still a little vague.
The problem is not the syntax of course, but the semantics. I had a
discussion about this over the weekend: yes things are improving.

>As far as academia is
>concerned, I can't see much of a problem. All that changes is the API
>shifts from Posix to Win32 which, not being much of a Posix fan, I'm
>certainly not going to cry over.

I am not a Posix fan either, but the problem is why CS Depts. are going
to teach CS students.

>As regards planes falling out the sky, all I can say is I think you are
>being rather unfair to airplane engineers who do a pretty damn good job of
>building safe machines, in spite of continual whining from a know-nothing
>press, willing to second-guess the cause of any problem 30 seconds after a
>plane goes down.

I trust plane designers (I used to be one myself) because planes are
designed from blood (however, I ACK this is being coming less so).
The problem comes with the introduction of a new technology which is
still in its infancy. Few coders write code which
they would knowingly trust their lives. But I didn't refer to planes, I
meant our aging air traffic control system. I have a climbing partner
who was the chief programmer at BART and he doesn't ride his own trains
(but not because of code quality).

Our society regards simulation as a subtle joke. We are getting better,
but we have a ways to go.

Weinberg's Second Law:
If builders built buildings the way programmers wrote programs,
then the first woodpecker that came along would destroy civilization.

A quote from Safeware (Levison):
My company has had a safety program for 150 years. The program was instituted
as a result of a French law requiring an explosives manufacturer to live on
the premises with his family.
--Crawford Greenwalt, Former President of Dupont.

Steven Correll

unread,

May 14, 1997, 3:00:00 AM5/14/97

to

I think operating system APIs matter, because we tend to imitate the
environment we program in. For example, if your environment teaches
you to expect that you can use open/read/write/close system calls on
any device, be it the console or the modem port or a disk file or a
TCP/IP socket, you have internalized an important lesson about
architectural design, far more concrete than if it's obscured by
terms like "object orientation" and "polymorphism".

If your environment requires you to use a special API for the console
versus a disk file, or requires you to decide whether you're writing to
a socket rather than a disk file so you can substitute "send/recv" for
"write/read", or if it delivers OS error message strings in a fashion
that's easy to present in a GUI dialog box but inconvenient to format
within a log file on disk, your programming will be doubly handicapped:
first because your own designs will be brittle and inflexible if you
imitate the example set by the environment, and second because even if
you wish to overcome the limitation of the environment, you will be
hard-pressed to anticipate all of the special cases and write all of
the extra code required to accomodate them.

Unix in 1997 is certainly no paragon; but IMHO it seems that the
MS-Windows world has lost many of the good ideas in Unix and is busy
repeating the mistakes of earlier generations.
--
Steven Correll == PO Box 66625, Scotts Valley, CA 95067 == s...@netcom.com

Hugh LaMaster

unread,

May 14, 1997, 3:00:00 AM5/14/97

to

Greg Pfister wrote:
>
> Eugene Miya wrote:
> >
> > Just a quickie, I'll get back to this thread when I have time from travelling.
> >
> > In article <335EAE...@nas.nasa.gov>,
> > Hugh LaMaster <lama...@nas.nasa.gov> wrote:

> > >It is interesting to compare what features Multics, TSS, VMS,

> > >and LTSS/CTSS had, ... ^^^^^^^^
> > >
> > >Besides Unix lacking things like ...
> > ^^^^^^^
> >
> > HUGH! After all these years, you still don't get it! 8^)

I never said I wasn't incorrigible. :-)
Yes, you are right, in some ways I still don't get it.
But, not in quite the same way that you suggest.

> > The cheapest, fastest, and most reliable components of a computer
> > are those that aren't there.
> > --Gordon Bell
> [snip]
> > The issue has NEVER been what features those above systems had.
> > It was the REMOVAL of features allowed the one to succeed and finds the
> > rest winnowing away. It's also why RISC architectures became popular:
> > simple, minimal.
> >
> > Get rid of creeping featuritis.

My approach to architecture and operating systems is the same
as my approach to backpacking (see Colin Fletcher's book)
[inexact quotation]:

If you need it, take it along.
Pare away relentlessly at weight.

I have never met anyone who thought an instruction set
architecture, implementation, or, an operating system kernel
needed lots of *the other guy's features*. The trick is to find
clean, simple, unifying ideas that can accommodate both you
and the other guy at the same time.

For example, one user wants something like a "fair"-share
scheduler (which I happen to like, but also like to call an
"unfair share scheduler"), another user needs to run really
big memory jobs efficiently, another user needs really good
interactivity on small jobs, another user needs time-critical
[soft realtime] scheduling of smallish jobs. If one assumed
that each of these features was totally separate, it might
seem legitimate to complain about creeping featurism, but,
if it turned out a clean, simple, scheduler with a unifying
principle existed [it does] to handle all of these tasks,
then the added *features* might actually be combined to result
in lowered complexity, rather than increased complexity.

> [snip]
> > Three types of U*x users existed: The Type 0: users of devices like
> > phones: completely transparent. Type 1: those who benefit from OS
> > portability: same shells or tools or environments across heterogeneous
> > platforms. And Type 2: those who adopt a software tools U*x philosophy.
>
> Eugene, you're espousing a direction that is in some senses defensible,
> but I think wrong. But I'm not saying complicated is good. Read
> further.

I think Greg Pfister and I are in complete agreement.
[I can't say whether or not he agrees, however. :-) ].

>
> What do you do with the users who need those features? For example,
> take policy-driven load balancing (in MVS, not *ix):

Excellent example.

The first policy-driven share scheduler that I personally
saw was running at Ames on VMS back in 1979 or so. Similar
Unix schedulers first appeared at about the same time
[I don't have citations, but Lauder had the first Unix
share-scheduler going in Australia at about the same time].
FSS's have been in production use for, what, about a decade
or so? [I forget the date on the Bell labs FSS.] Unicos
has had it on the Cray for a long time. But in the early days,
it was difficult for certain parts of the Unix community to look
at policy-driven scheduling as a "necessary feature", mainly
because they seemed to view it as an "accounting feature",
rather than as an *alternative* to fictitious bean-counting
of CPU cycles and other "consumed" [sic] resources. [The same
resource allocation problem exists in networking, and if the
internet community doesn't get more tuned in to the alternatives,
pretty soon we are going to start getting charged per byte
transmitted, and the cost of transmitting the byte will go up 10X
in order to pay for the accounting. There are options that
make more fundamental economic sense.]

User states
policy,
> like jobs of class A get so much of system until use Z minutes, jobs of
> type B get <etc.>, and the system figures it out given job classes.
> This is more than just easier to use than *ix "nice," it's required for
> shops that run complex job mixes. Major feature. Major complexity.
>
> Aha, guess what, that's in *ix! Through a third party: Computer
> Associates.
>
> So, is a *ix system with CA's scheduling stuff added more or less
> complex than a system with it built in? Well, both. It's probably not
> as integrated with other tools, and can't do some things as well as code
> with more access to the innter system guts, but on the other hand by
> structurally separating it from the kernel it is undoubtedly simpler to
> implement and maintain.

I don't know what CA's product does, but there are problems
with separating kernel priority-based schedulers from
external policy-based schedulers. First, the problem
of scheduling is most acute when you have large memory
jobs. But, if the CPU scheduler can only see process
priorities and is blind to memory allocation, it breaks
down when you most need it: e.g. when you have two or
more very large-memory jobs running. But, that doesn't mean
that lots of junk needs to be added to the kernel. It just
means that semi-static process priorities are an insufficient
mechanism for dispatching and for the scheduler/input queue manager
to communicate with the scheduler.

This digression could get extremely long, but, suffice it to say
that this is a perfect example of where many *features* and many
*requirements* can be accommodated by the correct simple but
powerful kernel mechanisms.

> Other example: Compare the complexity of a CISC implementation to, say,
> the out-of-order superscalar high clock rate RISC Alpha 21264. Which is
> more complicated? Fact is, microarchitectural complexity has now far
> outstripped complexity arising just from wierdly-placed bits in an
> instruction set. But does the simplicity of RISC make that
> implementation easier to manage? Assuradly. That's why new CISCs
> (e.g., Intel Pentium Pro) translate into an internal sort-of-RISC and
> pull the tricks on that.

[Aside: see below ***]

>
> My point is that the issue is NOT total number of features. The issue
> IS simplicity OF STRUCTURE. It is quite possible to have an immensely
> complex system that is comprehensible because it has a straightforward
> structure. The problem with MVS (if, my local colleagues would say,
> there actually is a problem...) isn't complexity as such. It's rather
> that (like recent word processors) it's had features grafted onto a
> structure that is no longer able to bear their weight. To get things
> comprehensible again it was necessary to start again with a suficient
> basic structure -- like that of the original Unix.

The above two paragraphs are succinct.

Unfortunately, "operating system", as perceived by a user,
includes many more megabytes of code than just the kernel.
The traditional approach to solving this complexity problem
is to work hard at cleaning up as the kernel evolves, so
that things are layered properly and large, non-performance
critical programs are handled in separate processes. There
may be other approaches, such as explicit capability-based
systems that divide up the kernel activities into smaller,
secure modules.

------

[***] Since this is also posted to comp.sys.super, the hardware
complexity question is appropriate. It would seem that
superscalar instruction issue is now sophisticated enough
that there is little point in adding vector instructions,
and, if [L1 or, in some cases such as SGI R8k machines, L2]
caches have the same latency and bandwidth as sets of vector
registers, there is little point in vector registers.
Is there any reason to build machines with vector registers
and instruction sets any more? [Including VLIW machines like
the Multiflow?] Assuming that a micro-based superscalar
machine could be built with the same main memory bandwidth
as a big vector machine, would there be any advantage on
vector code to having a vector machine? [Not that anyone
has done this yet. The aggregated stream bandwidth of
all 32 processors of one of the highest-bandwidth machines
recently, the SGI Origin 2k, is still less than the stream
bandwidth of a single Cray T90 CPU. Likewise the 63-processor
Sun UE 10k)]

Andy Glew

unread,

May 15, 1997, 3:00:00 AM5/15/97

to Hugh LaMaster

> For example, one user wants something like a "fair"-share
> scheduler (which I happen to like, but also like to call an
> "unfair share scheduler"), another user needs to run really
> big memory jobs efficiently, another user needs really good
> interactivity on small jobs, another user needs time-critical
> [soft realtime] scheduling of smallish jobs. If one assumed
> that each of these features was totally separate, it might
> seem legitimate to complain about creeping featurism, but,
> if it turned out a clean, simple, scheduler with a unifying
> principle existed [it does] to handle all of these tasks,
> then the added *features* might actually be combined to result
> in lowered complexity, rather than increased complexity.

Hugh:

I have a long standing interest in scheduling.
I love fair share, and I understand you calling it unfair share.

I think it is absolutely bogus that, in these days, people complain
when I send a 2000 page print job to the printer, and think that I
should
wait until night - when a reasonable share scheduler could allow their
jobs to pass mine with no problem. (Wasting paper is another issue,
however.)

But, I have never found the clean unifying principle you espouse
that can effectively reconcile disparate scheduling policies.
(Or, rather, the closest that I have come is stochastic scheduling
- adding randomization so that a multigoal optimization policy can
be approximated.)

What is it? Is it the "policy driven scheduler" you describe elsewhere
in your post?

I *thought* that I was familiar with UNICOS's scheduler, but maybe
I have missed something. Or maybe it is just a question of scale -
at some scales the multipolicy scheduler is effective, at others not.

This may seem like a comp.os.research issue, but I think that it is also
a
comp.arch issue because modern microarchitectures also find it desirable
to
optimize multiple goals: power consumption, instruction throughput,
memory
throughput, etc., etc. My experience as an OS scheduler developer has
often been useful to me in understanding CPU design.

---

Digital accuses me of stealing their inventions.
Can I sue Bob Palmer for slandering me?

Eugene Miya

unread,

May 16, 1997, 3:00:00 AM5/16/97

to

Quick note, I will get back to this thread when I get a chance.

I just received a postcard on Microsoft Scalability Day, May 20,
and that Bill Gates will be speaking on the topic of scalability
along with Gordon Bell. Mentions 64-bit, mentions Tandem (that has me
wondering how NT is doing in the fault-tolerant market).
It uses buzz words like "Enterprise" systems.
Bill will be video down linking (this is going to be in other cities?),
Gordon will be in Santa Clara.

Oh well, I'm hosting Nolan Bushnell at one of my meetings at the exact
same time. Ain't life strange?

Eugene Miya

unread,

May 16, 1997, 3:00:00 AM5/16/97

to

In article <5julur$g...@newshost.nmt.edu>,

Victor Yodaiken <yoda...@chelm.cs.nmt.edu> wrote:
>In article <5jjfhe$5...@darkstar.ucsc.edu>,
>Eugene Miya <eug...@cse.ucsc.edu> wrote:
>>I think the future of OSes is currently in the hands of a 41 year old guy
>>up in Redmond, WA, regardless of what Linus does with Linux, Stallman does
>
>Like to bet?

Depends on the wager. I only bet when I believe that I have a 90%
chance of winning (i.e. no fun).

>>with Gnu, etc. I'm myself not clear on all the implications of this.
>>I think this, however: if I were a CS prof teaching operating systems,
>>I'd be worried. The issues aren't ones of technical merit. It's an
>
>Why should I worry? And why is it news that technical merit does not
>determine success in the engineering "marketplace" (for lack of a better
>term)?

Why should you worry: where will the OS diversity be?
where's your job?
Technical merit: Some idealists hope and ignore installed bases.
I love a good fight.

>>Where are MVS programmers going to come from?
>
>Retirement.

Perhaps.

>>Do the university departments who produce the students want to start
>>migrating over to NT? Source code?
>
>Why should they?

You would otherwise be producing students of no commercial worth/value.
Students would then have to learn NT internals, etc.

>>Who fixes the FAA air traffic control system when it's software breaks?
>>Who fixes the IRS Unisys systems?
>
>Well, if we start teaching only how to use the current market leader
>instead of teaching fundamental skills, nobody.

Good. I want to see that happen.

>>I wonder if the petaflops machine will run NT.
>
>Not at petaflops.

I think quite a few people would agree with you except maybe a few NT people.

>>In the begin was Bill. And Bill looked upon the world and said,
>
>People used to talk this way about IBM.

//SYSIN DD *
data
/*

>>Our society is not currently capable of comprehending bugs. It has a
>>rough idea, but a very crude one. I expect planes to fall out of the sky.
>>I really do. The problem is that those planes kill relatively innocent
>
>They already do. What happened to the first commercial jet liners?

They crashed like the Comet.

>>people, not software developers (that's would be evolution by
>>natural selection). We are currently more interested in making money
>>and staying on track that attempting to do the right job.
>
>Which is why we should long for the days when the B6700 and
>CDC machines blew away IBM on pure technical merit?

No, I believe we can do better than that.

Eugene Miya

unread,

May 16, 1997, 3:00:00 AM5/16/97

to

In article <bforney-2304...@eagan-rip03.cray.com>,

Brian Forney <bfo...@cray.com> wrote:
>Unfortunately, university professors seem to think this is an important
>topic. I know professors from my undergraduate institution are debating
>this topic.
>
>I say teach students the concepts and the fundamentals. Examples from real
>world software and hardware systems can be used, but universities are not
>votech schools. Maybe assignments in an OS clas will rely upon a certain
>OS, but don't force one and only one OS's paradigms on students. If you
>need freely available source to teach students, then do so. If you need
>the most advanced OS to illustrate a difficult problem, then do so.
>
>Often times UNIX variants happen to be the best for this because there are
>so many flavors and free versions exist. (How would I make a class add
>real-time support to, say, the MacOS? Apple doesn't even license source,
>and if they did, how would the average university pay for it?)

I have always thought that my early experiences traversing the ARPAnet
served me well. The heterogeneous diversity of machines would not be beat.
I agree that Us should not be votech schools, but many profs have single
system experience. I hate to say it but you would learn a lot from the
student annoyances.

I considered that a mini-ARPAnet would make an interesting teaching tool
and an education product. It should have machines and emulators with
vastly different architectures, and OSes, file systems, protocols, etc.
It would be a difficult product/market teaching tool to maintain.

This is why networks were set up: use scarce resources, find file
systems with attached version numbers, 44 character dataset names,
strange control characters, byte vs. word addressing.
Maybe we need SimNet (as opposed to SimCity)?

During a recent NSF site review, I was speaking with someone who worked
at PARC, a student of a distinguished prof at one of a number of nearby
local universities. The PhD never in their career had it occur to
them that their funding competed against the funding of other sciences.
I can tell you that this is not good, because when that person gets sent
to a stint at the NSF, they will spend more of their time learning
and not enough promoting CS budgets. Time to leave. Next person.

I certain do know DEC people who've taken Macs and in turned them into
real-time experiment and process control engines: it is possible.

Eugene Miya

unread,

May 17, 1997, 3:00:00 AM5/17/97

to

In article <33733C...@us.ibm.com>,
Greg Pfister <pfi...@us.ibm.com> wrote:
>Eugene Miya wrote:
"I dreamed the real world adopted the Unix philosophy...
...I stepped out of the store and got caught in a huge net.
I awoke screaming."

>Eugene, you're espousing a direction that is in some senses defensible,
>but I think wrong. But I'm not saying complicated is good. Read further.
>

>What do you do with the users who need those features?

That's why this thread started, Unix set OS research back years by
preventing the addition of those features in a proper tested and
experienced way. The question isn't if, it's more a combination of
"when" AND "how."

Network communication wasn't part of the original OS.
File locking still isn't part.
Thompson/Ritchie in the original Bell System Tech. Journal article
asserted that things like locking were more the perview of the
application program, not the OS. That's still radical for a lot of
people to this day.

Certainly I think Ken and Dennis' selection to made device, process, and
file I/O look the same was good. Grafting news, etc. well wasn't quite
as clean.

Fork(2) wasn't bad either.

>For example,
>take policy-driven load balancing (in MVS, not *ix): User states policy,

>like jobs of class A get so much of system until use Z minutes, jobs of
>type B get <etc.>, and the system figures it out given job classes.
>This is more than just easier to use than *ix "nice," it's required for
>shops that run complex job mixes. Major feature. Major complexity.

This is similar to the locking example given.
The stuff which stayed inside survived the acid test of the Labs and the
small address space of the PDP-11.

I think a big part of it is how we package the modularity of OS
function. I learned PL/1, too. My second language. I don't use it anymore.

>Aha, guess what, that's in *ix! Through a third party: Computer
>Associates.
>
>So, is a *ix system with CA's scheduling stuff added more or less
>complex than a system with it built in? Well, both. It's probably not
>as integrated with other tools, and can't do some things as well as code
>with more access to the innter system guts, but on the other hand by
>structurally separating it from the kernel it is undoubtedly simpler to
>implement and maintain.
>

>Other example: Compare the complexity of a CISC implementation to, say,
>the out-of-order superscalar high clock rate RISC Alpha 21264. Which is
>more complicated? Fact is, microarchitectural complexity has now far
>outstripped complexity arising just from wierdly-placed bits in an
>instruction set. But does the simplicity of RISC make that
>implementation easier to manage? Assuradly. That's why new CISCs
>(e.g., Intel Pentium Pro) translate into an internal sort-of-RISC and
>pull the tricks on that.
>

>My point is that the issue is NOT total number of features. The issue
>IS simplicity OF STRUCTURE. It is quite possible to have an immensely
>complex system that is comprehensible because it has a straightforward
>structure. The problem with MVS (if, my local colleagues would say,
>there actually is a problem...) isn't complexity as such. It's rather
>that (like recent word processors) it's had features grafted onto a
>structure that is no longer able to bear their weight. To get things
>comprehensible again it was necessary to start again with a suficient
>basic structure -- like that of the original Unix.

I had a really good discussion with Bill Wulf in the mid 1980s about
OS futures. We (the computer community) really have to come up with
a workable way of glueing features together so that missing pieces don't
come back to bite us. I started using the YACC compiler generator analogy
(both for generation and testing). Bill cringed (but then he was biased,
he developed PQCC and he later give me an opportunity to visit Tartan Labs
[NDA]). The idea was to generate an OS as needed. This was in the days
when more experts inhabited net.arch.

Another possibility came from a friend who worked on BART and at Tandem.
That climbing partner (I note that because he will write code which his
life depends) noted with some pride that most of the time Tandem systems
can remain running even for any sort of upgrade, except OS opgrade,
that's the only time they have to bring the machine down, and they want
to replace that fault (as they see it). Most people don't have
experience with that. Talk about robust! [It's a fragile kind of robust.]

BTW, it has been argued that one useful conference session name I have
heard is:
All good things come to an end at version 7.
This has been cited for Unix, MacOS, and the famous Mythical Number 7
plus or Minus 2 paper. (i.e. "original" unix or MVS.)
The Mythical paper was published by Ed Yourdon in one of his software
engineering books.

I just hope that Inferno has some good ideas. I like Rob Pike, he's
bright and has a clear head, but I also realize Rob's limitations.
I just don't have enough time to study it (except the time we talked at
COMPCON).

We could be worse off. My head reels from some of the comments made by
the LISP machine people, but who knows, maybe they were right, maybe we
don't need operating systems.

I have also spoken a little with Henry Masslin (mentioning this
causes Darrell Long to cringe: Henry and his teddy bear, or another
netter who will punch Henry in the nose if she gets bugged about giving
a back ride). I wish I had better insight into what Henry did to
attract a following.

>Greg Pfister

BTW, I was looking at a review of your Clusters book.
It motives this Wiley editor to contact Prentice-Hall for a review copy.

The usual reference about complexity:
Chapter 6, H. Simon, The Sciences of the Artificial

Paul Bunyk

unread,

May 19, 1997, 3:00:00 AM5/19/97

to

eug...@cse.ucsc.edu (Eugene Miya) writes:

>
> In article <5julur$g...@newshost.nmt.edu>,
> Victor Yodaiken <yoda...@chelm.cs.nmt.edu> wrote:
> >In article <5jjfhe$5...@darkstar.ucsc.edu>,
> >Eugene Miya <eug...@cse.ucsc.edu> wrote:
> >>I wonder if the petaflops machine will run NT.
> >
> >Not at petaflops.
>
> I think quite a few people would agree with you except maybe a few NT people.
>

Well, so far both petaflops projects I've heard about (and I'm working
on superconductive hardware for them) were discussing UNIX ;-)

People do not really need an MS Office running on a petaflop computer,
good parallelizing compiler and tricky scheduler is what is important
in the OS for petaflops => no place for Microsoft in *that* business!
;-)

Paul

--
("`-''-/").___..--''"`-._ UNIX *is* user-friendly, he is just very
`6_ 6 ) `-. ( ).`-.__.`) picky about who his friends are...
(_Y_.)' ._ ) `._ `. ``-..-' Paul Bunyk, graduate student
_..`--'_..-_/ /--'_.' ,'art by (and part-time UN*X sysadm)
(il),-'' (li),' ((!.-' F. Lee http://pbunyk.physics.sunysb.edu/~paul

Mike Haertel

unread,

May 20, 1997, 3:00:00 AM5/20/97

to

In article <5lj372$1...@darkstar.ucsc.edu>, Eugene Miya wrote:
>The idea was to generate an OS as needed. This was in the days
>when more experts inhabited net.arch.

Plenty of experts still inhabit comp.arch. However it's hard
to have interesting discussions when the problems you're most
concerned with are related to future products that must be
kept confidential for business reasons.

If you want good technical discussions, hang out on comp.lang.<blank>
for sufficiently non-trendy <blank>.

>I just hope that Inferno has some good ideas.

Inferno combines at least four good ideas: Plan 9, Java, Tk, and
minimal resource requirements. It's quite nice. A pity they want
a fortune for source licenses; now that free Unixes have cracked
the source code barrier I'm not interested in software w/o source.

>I have also spoken a little with Henry Masslin [...]

>I wish I had better insight into what Henry did to
>attract a following.

He achieved remarkably good performance (an order of magnitude
better than conventional OS's) on improbable hardware. Unfortunately,
it's not clear that his techniques are of use to any but the most
rarified stratum of programmers. (Perhaps you could argue that
the usefulness of an idea is proportional to the number of people
who can use it multiplied by the leverage it gives, but that's a
topic for another day...)

Hugh LaMaster

unread,

May 20, 1997, 3:00:00 AM5/20/97

to

I guess our site already expired these posts.
I'm not quite sure who said what. For the record:

[Someone said:]

> >Which is why we should long for the days when the B6700 and
> >CDC machines blew away IBM on pure technical merit?

I just thought I would mention that in the 60's, there were
many innovations in various models of the IBM/360 series.
Look in Siewiorek, et al. IBM decided during the early 70's
computer recession that all that innovation was costing them
too much money for the number of innovative machines shipped.
Gene Amdahl has written about the defining moment, although
I don't recall the exact date. But, prior to that, quite a
few architectural innovations, either by IBM, or by others
but recent at the time, made it onto various models of 60's
production machines.

And, prior to Amdahl leaving, IBM did want to build machines
competitive with CDC on speed. In retrospect, one of the
things that IBM didn't seem to realize, that Cray did, was
the signficance of purely physical and electrical things.
Cooling. Wire lengths. Overall packaging. IBM, and most
other, machines of that era were very messy inside compared
to Cray's clean and compact designs.

[The Burroughs design was interesting, for different reasons.
I wonder if anyone has looked at how well those machines
would have run Java ;-) ].

Enough history for now.

Victor Yodaiken

unread,

May 21, 1997, 3:00:00 AM5/21/97

to

In article <5lj372$1...@darkstar.ucsc.edu>,

Eugene Miya <eug...@cse.ucsc.edu> wrote:
>That's why this thread started, Unix set OS research back years by
>preventing the addition of those features in a proper tested and
>experienced way. The question isn't if, it's more a combination of
>"when" AND "how."

Technically, I think you are way off base. Thompson&Ritchie's
work was one of those revolutionary engineering designs that
shows that 90% of the complexity of traditional designs are
unecessary. The idea of files as character streams, the
unifying paradigm of file i/o and simple file tree,
the simplified fork, the separation of shell and OS,
the invention and use of C and even the block/character devices are impressive.
Furthermore, UNIX utility programs were simple and powerful and
enormously easy to connect via pipes and scripts. The resulting
platform was just better than the alternatives: even though many
of these ideas appeared in other places.

Economically, however, UNIX did hold back OS research, only because
after the technical advantages and source code distribution had
make UNIX the engineering standard, the nitwit lawyers and
marketing gurus at AT&T suddenly shut the door. I think that only
now, with the emergence of the free UNIXs, have we recovered from
this enormous idiocy.

>I have also spoken a little with Henry Masslin (mentioning this
>causes Darrell Long to cringe: Henry and his teddy bear, or another
>netter who will punch Henry in the nose if she gets bugged about giving
>a back ride). I wish I had better insight into what Henry did to
>attract a following.

Massalin's Ph.D. thesis contains at least 3 wonderful ideas.
1. "compiling" OS functions on the fly to eliminate layers.
2. using non-locking synchronization in the OS
3. software plls.

It also shows that a very good programmer with a sharp total
understanding of OS operation and the tradeoffs of the hardware
base can make order of magnitude improvements in performance over
the usual standard.

Greg Pfister

unread,

May 22, 1997, 3:00:00 AM5/22/97

to

Eugene Miya wrote:
[snip]
> [postcard about Scalability Day]
> ...Mentions 64-bit, mentions Tandem (that has me

> wondering how NT is doing in the fault-tolerant market).

At the moment, not at all without some help from its friends.

There's cluster software code-named "Wolfpack" (after the cover on my
book :-) that is due out in June. Others are making noises like it
won't be out then, but June is what the product manager says. It's Beta
now.

However, into a vacuum flows much stuff. There are hardware FT systems
that run NT (like Marathon), cluster software add-ons from several
people (NCR, Vinca, Digital, others), and several folks who've announced
or demonstrated "parallel" databases across multiple NT systems (Tandem,
Oracle). Apparently a pre-beta parallel MS SQL Server will appear at
Scalability day, doing a billion transactions/second on several pentium
systems. That will be used to indicate that NT has scalability. That's
of course 1/2 false (the scaling across the machines is from the DB than
the OS) and 1/2 true (the OSs are NT, so...)

Greg

Eugene Miya

unread,

May 23, 1997, 3:00:00 AM5/23/97

to

I mentioned some early network experience (ARPA'1973) and I got some
email about architectural diversity (the respondent and I have had fun
talking about that in the past). But I don't want to relive that history.

We (users and developers of computing machinery) are going to continue
to have problems with the way we teach kids about computing if we
continue to do what we have done in the past. I am not complaining
about an indepth education, but ACM Core Curriulum which a lot of CS
Depts. are pushing will fuel a lot of resentment from other disciplines
(already has as cited: the phrase cited in a physics report was
"mere programming"). We want and need educated programmers, but it
doesn't help when you get efforts like Mach and LOCUS only running on
32-bit machines (one example). Generality has a place as does breath of
experience. It is a Depts.' job to balance depth and breath, and it
will get harder and more complex.

Not all machines use the same character set. Floating point does
differ, although less so than before after IEEE 754. 36-bit machines
still exist. Few people realize that programs like SPICE actually
started on a 36-bit machine. Boy, ever try porting a 36-bit program to
a 32-bit machine? I hear "Java is portable," yeah right. That tells me
more about the cluelessness of the speaker and their lack of experience
running on diverse platforms. They would be a fine coder.

The problem is that CS Depts. lack incentives to expose students to
diverse hardware, and it's a maintenance nightmare. That becomes an
incentive to One-true architecture. One-true OS, and one-true language.

It is unfortunate that the net isn't more open as it once was.

Preston Briggs

unread,

May 23, 1997, 3:00:00 AM5/23/97

to

Eugene Miya <eug...@cse.ucsc.edu> wrote:

>The problem is that CS Depts. lack incentives to expose students to
>diverse hardware, and it's a maintenance nightmare. That becomes an
>incentive to One-true architecture. One-true OS, and one-true language.

Indeed, there's a big dis-incentive, since the departments want to
avoid that maintenance nightmare themselves. Perhaps there should be
a move back to having students maintain the machines they work on
(having finished, I can say that).

On the other hand, CS is more than machines.
I doubt Tarjan's students would profit too much from
being sysadmins on a variety of workstations.

Preston Briggs

Eugene Miya

unread,

May 23, 1997, 3:00:00 AM5/23/97

to

In article <5lv8sv$9...@newshost.nmt.edu>,

Victor Yodaiken <yoda...@chelm.cs.nmt.edu> wrote:
>In article <5lj372$1...@darkstar.ucsc.edu>,
>Eugene Miya <eug...@cse.ucsc.edu> wrote:

>>That's why this thread started, Unix set OS research back years by
>>preventing the addition of those features in a proper tested and
>>experienced way. The question isn't if, it's more a combination of
>>"when" AND "how."
>

>Technically, I think you are way off base.

I'm off base?

>Thompson&Ritchie's
>work was one of those revolutionary engineering designs that
>shows that 90% of the complexity of traditional designs are
>unecessary.

It was the Labs guys who made the 10 year insight.
Well actually, their insight was not unique. It's why RISC architectures
and simpler programming languages also came into prominence after.
There are well known papers which pointed that the majority (typically
80-90%) of features (instruction sets, commands) NEVER get used.
It happened after MULTICS, PL/1 and ALGOL 68, and pre-VAX CISC architectures.

I would bet dollars to donuts that it will happen again (on different
time periods for the three different topics). It's a problem which
people have about complexity. We can't handle a lot of it. As Greg has
pointed out, people want some of those features. The challenge is how
to implement them (see the "how" above).

Summary points of BSTJ 1978 paper trimmed.

>Furthermore, UNIX utility programs were simple and powerful and
>enormously easy to connect via pipes and scripts. The resulting
>platform was just better than the alternatives: even though many
>of these ideas appeared in other places.

My favorite example of the above was spline(1) and spline(3).
The neat thing was based largely on the convention of ASCII ordered
pairs of numbers. BUT IT WAS A CONVENTION, not a standard or a big
deliberate design consideration. The problems in other systems include
problems like side-effects, false economies of scale. People got stuck
in languages. Here in this system and on the net, people had co-routines.
People could see back ideas of parallelism and simple IPC. Modularity
was king!

The net is fortunate to have guys like Mash who were around then and
iterated these ideas and ARE STILL AROUND. It was the experience gained
from the iteration, trying those ideas, which refined the OS.

However it's been attacked: e.g. the common record-locking criticism
is one attack. It sits ill with those who want more than character and
block I/O. Do you want me to attack more of the already rehashed
problems with the OS? These are well documented. I can drive up to
Atherton and get Don Norman's collected file of Unix criticisms.

Sometimes these ideas are too simple (spline only worked with 1-d functions).
But they showed an important direction. You have to have the source
code and understand splines to extrapolate to high dimensions.

>Economically, however, UNIX did hold back OS research, only because
>after the technical advantages and source code distribution had
>make UNIX the engineering standard, the nitwit lawyers and
>marketing gurus at AT&T suddenly shut the door. I think that only
>now, with the emergence of the free UNIXs, have we recovered from
>this enormous idiocy.

Economics works against research. v6: $150, commerical v6: $20K,
v7: $300, v7 to a commerical concern was $30K. And it went up from there.

>Massalin's Ph.D. thesis contains at least 3 wonderful ideas.
>1. "compiling" OS functions on the fly to eliminate layers.
>2. using non-locking synchronization in the OS
>3. software plls.
>
>It also shows that a very good programmer with a sharp total
>understanding of OS operation and the tradeoffs of the hardware
>base can make order of magnitude improvements in performance over
>the usual standard.

Sounds fine. I will remember that the next time I see Henry.

Eugene Miya

unread,

May 24, 1997, 3:00:00 AM5/24/97

to

In article <sjcEA6...@netcom.com>, Steven Correll <s...@netcom.com> wrote:
>I think operating system APIs matter, because we tend to imitate the
>environment we program in. For example, if your environment teaches
>you to expect that you can use open/read/write/close system calls on

Consider what it means to be a "file." First, some environments don't
recognize that terminology. Let's partially ignore that.

>any device, be it the console or the modem port or a disk file or a
>TCP/IP socket, you have internalized an important lesson about
>architectural design, far more concrete than if it's obscured by
>terms like "object orientation" and "polymorphism".

Files have "names." The syntax of a filename might be say:
some number of alphanumeric characters a period (or dot) followed by
maybe a three character extent. Any resemblence to existing systems is
pure coincidence. Then perhaps, say I add something which I will call a
version. Then I have to hand that. Or instead, say I open(2) a
44-character entity with a minimum of four dots and no text string
longer than 8 characters: what happens when I get a code which opens a
file name like "thisisnot8.txt"? Don't get me wrong. I fully agree with
you, but this is why languages and OSes aren't enough in themselves.
These are merely syntactic examples (version numbering is arguably
semantic in one case: default always assumes the highest version number).

Embedded blanks in file names have interesting consequences (I don't
want to get into that can of worms).

Some people will say these are trivial portability details. They are,
however, quite extensive as problems. You need source code. You need a
compiler. The costs add up.

>If your environment requires you to use a special API for the console
>versus a disk file, or requires you to decide whether you're writing to
>a socket rather than a disk file so you can substitute "send/recv" for
>"write/read", or if it delivers OS error message strings in a fashion
>that's easy to present in a GUI dialog box but inconvenient to format
>within a log file on disk, your programming will be doubly handicapped:
>first because your own designs will be brittle and inflexible if you
>imitate the example set by the environment, and second because even if
>you wish to overcome the limitation of the environment, you will be
>hard-pressed to anticipate all of the special cases and write all of
>the extra code required to accomodate them.

True.
Big switch/case statements and #ifdefs.

>Unix in 1997 is certainly no paragon; but IMHO it seems that the
>MS-Windows world has lost many of the good ideas in Unix and is busy
>repeating the mistakes of earlier generations.

Agreed.

Eugene Miya

unread,

May 24, 1997, 3:00:00 AM5/24/97

to

In article <557mgvz...@pbunyk.physics.sunysb.edu>,

Paul Bunyk <pa...@pbunyk.physics.sunysb.edu> wrote:
>Well, so far both petaflops projects I've heard about (and I'm working
>on superconductive hardware for them) were discussing UNIX ;-)

Oh, I don't doubt projects aren't using NT for PFLOPS OSes. It's what
will follow which has me wondering.

>People do not really need an MS Office running on a petaflop computer,
>good parallelizing compiler and tricky scheduler is what is important
>in the OS for petaflops => no place for Microsoft in *that* business!
>;-)

I wish the software technology existed for a "good parallelizing compiler."
Not what people claim, but people regard as "drop in." I think Bill
would agree with you, but scientists like their tools fast. If not
office, then the question is what tools? Consider this reference which
I recently received:

%A Bart Childs
%T TeX on the Cray
%J TUGboat
%V 9
%N 2
%D 1988
%P 181
%K CTSS version,
%X "CTSS interface to be about as oppressive as those in unix and MS-DOS.
The flat file structure is worse." Why? It was needed and they didn't want
to learn yet another interface.

Not office, but possibly worse. 8^)

Nice ASCII tiger. I enjoyed it.

Eugene Miya

unread,

May 24, 1997, 3:00:00 AM5/24/97

to

>> Maynard Handley wrote:

FYI Hugh I ran into Maynard last evening at the local SIGGRAPH meeting.

In article <335EAE...@nas.nasa.gov>, Hugh LaMaster
<lama...@nas.nasa.gov> wrote:

>> At risk of sounding totally square, I have to object that this
>> is only a CS prof permitting his students to learn stories
>> instead of ideas, and, in that case, such a prof isn't teaching
>> computer *science* or *engineering*, but, (merely) vocational
>> training. Drafting, instead of architecture. Sure, anybody

Don Norman (The Design/Psychology of Every Day Things) and
Stewart Brand (How Buildings Learn) give some bad examples of
building architecture gone awry.

Your Kay reference is a good reference, but I would object to a couple
of points about it: fundamentally: that we are perhaps at a cross-roads
and that it is a time for a change. We at one time believed that the
world was flat, and that it was okay to enslave certain people. That
doesn't mean that we have to continue to do that.

>> can draw up plans, and, anybody can install the latest vendor OS
>> according to the recipe. [Not that I have anything against
>> vocational training. Everyone should have some, even scientists.]
>> But [Kay] learning stories is not the same thing as learning
>> to logically analyze ideas or complex systems. A CS [or any
>> other _______ *science*] prof, should not waste time on unnecessary
>> vocational education. There are thousands of private and corporate
>> training programs available where one can learn the latest
>> hot commercial topics at any time.

The problem with private and corporate training programs is that they
tend to be rather too focused. Advances in COBOL? Sure, fine. Java:
sure to a point, depends where you are. Other topics aren't so refined.
A training program might not be the best way to do it.

In article <bforney-2304...@eagan-rip03.cray.com>,
Brian Forney <bfo...@cray.com> wrote:
>Unfortunately, university professors seem to think this is an important
>topic. I know professors from my undergraduate institution are debating
>this topic.

Same here.

>I say teach students the concepts and the fundamentals. Examples from real
>world software and hardware systems can be used, but universities are not
>votech schools. Maybe assignments in an OS clas will rely upon a certain
>OS, but don't force one and only one OS's paradigms on students. If you
>need freely available source to teach students, then do so. If you need
>the most advanced OS to illustrate a difficult problem, then do so.

Actually, that's why survey classes exist. Actually, I would "force"
("Be one with...") diverse exposure. It's why people collected animals
and had zoos, and why people travel, and why universities used to
enforce a foreign language requirement for all majors.

>Often times UNIX variants happen to be the best for this because there are
>so many flavors and free versions exist. (How would I make a class add
>real-time support to, say, the MacOS? Apple doesn't even license source,
>and if they did, how would the average university pay for it?)

The availability of source code was initially a positive step.
Companies are changing their ideas about this. The problem comes with
new ideas. You have to have the future unwritten app to understand the
dynamics to understanding the developer had flawed, imperfect ideas
about his or her program. You can never beat executable code for a
humbling experience.

Previously used c.s.s. references on architecture:
This goes for computers as well as buildings:

The problems of "art" as architectural aspiration come down
to these:
Art is proudly non-functional and impractical.
Art reveres the new and despites the conventional.
Architectural art sells at a distance.
...
Art begets fashion; fashion means style; style is made of
illusion (granite veneer pretending to be solid; facade
columns pretending to hold something); and illusion is no
friend of function. The fashion game is fun for architects to
play and diverting for the public to watch, but it's deadly for
building users.
--Stewart Brand

Is there such as thing as common sense?
Oh sure, but I wouldn't trust it if I were you.
Common sense is just ordinary thinking.
It varies from culture to culture and reflects more the progress of
a particular society rather than the basic truth. Common sense
has been responsible for people thinking that the Earth was a flat disk,
that the Sun and planets revolve around it, and that the stars only come
out at night.
Marilyn vos Savant

Arun Sharma

unread,

May 24, 1997, 3:00:00 AM5/24/97

to

On 23 May 1997 19:59:12 GMT, Preston Briggs wrote:

p> Indeed, there's a big dis-incentive, since the departments
p> want to avoid that maintenance nightmare themselves.
p> Perhaps there should be a move back to having students
p> maintain the machines they work on (having finished, I can
p> say that).

These days it's pretty common to see students maintain their own Linux
boxes. I myself did that in school and I think it was a good learning
experience.

-Arun

--
Arun Sharma -- arun at home.com -- http://cc1000916-a.avnl1.nj.home.com

Victor Yodaiken

unread,

May 25, 1997, 3:00:00 AM5/25/97

to

In article <5m54o5$i...@darkstar.ucsc.edu>,

Eugene Miya <eug...@cse.ucsc.edu> wrote:
>It was the Labs guys who made the 10 year insight.
>Well actually, their insight was not unique. It's why RISC architectures
>and simpler programming languages also came into prominence after.
>There are well known papers which pointed that the majority (typically
>80-90%) of features (instruction sets, commands) NEVER get used.
>It happened after MULTICS, PL/1 and ALGOL 68, and pre-VAX CISC architectures.
>
>I would bet dollars to donuts that it will happen again (on different
>time periods for the three different topics). It's a problem which
>people have about complexity. We can't handle a lot of it. As Greg has

Yes. But it's quite a great trick to figure out which complexity can
be discarded.

>pointed out, people want some of those features. The challenge is how
>to implement them (see the "how" above).

>However it's been attacked: e.g. the common record-locking criticism

>is one attack. It sits ill with those who want more than character and
>block I/O. Do you want me to attack more of the already rehashed
>problems with the OS? These are well documented. I can drive up to
>Atherton and get Don Norman's collected file of Unix criticisms.

Of course there are many drawbacks in original design. I'm not
sure if I agree with the record locking critique, but there were
other flaws. And sure, I'd like to see Norman's file.

Zalman Stern

unread,

May 25, 1997, 3:00:00 AM5/25/97

to

Victor Yodaiken (yoda...@chelm.cs.nmt.edu) wrote:
: Of course there are many drawbacks in original design. I'm not

: sure if I agree with the record locking critique, but there were
: other flaws. And sure, I'd like to see Norman's file.

The Don Norman's UNIX critique I've seen goes on about how the rm command
is not named something menmonic like "delete" and how there is no
interactive option to prompt for each file in a directory whether you want
to delete it or not. And from there I guess someone wrote an interactive
remove program but it was originaly meant to be run standalone on a PDP-11
(???) using the console switches to determine whether to delete the file or
not so it was called "dsw" for "Delete Set Switch." Overall this paper is
not a high point of Norman's writing.

I guess my attitude is that if you're not the kind of person who learns
command line syntax for scripting, no command line interface is going to be
a good user interface. So why bother criticizing the particular
implementation when the entire concept is inappropriate to the task?
(Especially when UNIX command line was vastly more useable than VMS, if not
quite up to Twenex.)

Now what passes for GUI on UNIX makes it pretty clear that the UNIX
community gets user interface about as well as the Mac community gets
(well, got) OS design... (Actually, I take that back. The MacOS at least
lets you write moderately realtime code when you need to. So it has some
redeeming qualities. Can't think of any for UI under X11 :-))

-Z-

David Collier-Brown

unread,

May 26, 1997, 3:00:00 AM5/26/97

to Eugene Miya

Alas, this seems to happen a lot: Unix used the obviously
good ideas from Multics/Project MAC, and dropped
1) the dumb ones
2) the complex ones.

Subsequenty, I see OS vendors attempting to add
the bad idea and complexity back in (:-))

--dave (I hate all vendors. I'm a vendor) c-b
--
David Collier-Brown, | Always do right. This will gratify some people
185 Ellerslie Ave., | and astonish the rest. -- Mark Twain
Willowdale, Ontario | dav...@hobbes.ss.org, canada.sun.com
N2M 1Y3. 416-223-8968 | http://java.science.yorku.ca/~davecb

Eugene N. Miya

unread,

May 27, 1997, 3:00:00 AM5/27/97

to

In article <slrn5o3sr...@ducky.net> mi...@ducky.net (Mike Haertel) writes:
>Plenty of experts still inhabit comp.arch. However it's hard
>to have interesting discussions when the problems you're most
>concerned with are related to future products that must be
>kept confidential for business reasons.

I have signed many non-disclosure agreements, conduct forms, security
what-nots short of getting a full-time clearance. There remain quite a
few knowledgeable in c.a. (you guys are luck to have Mash, and even
Burton posts every now and again). And recently, even Alan Smith has
asked me for money, and we might try to produce another grad student
(if there are any Berkeley CS grad students looking for a topic
reading: it will be hard to top the last one, he got the dissertation of
the year and is now a USC prof.).

>If you want good technical discussions, hang out on comp.lang.<blank>
>for sufficiently non-trendy <blank>.

Unless I wanted to discuss OSes. ;^) No time. I will bid my time in
c.s.s. and c.p. when I get a chance (I have a week of travel lull).

Actually the best place for technical discussion tends to be Castro Street
in Mtn. View at lunch.

>Inferno combines at least four good ideas: Plan 9, Java, Tk, and
>minimal resource requirements. It's quite nice. A pity they want
>a fortune for source licenses; now that free Unixes have cracked
>the source code barrier I'm not interested in software w/o source.

COMPCON'97 papers.

>>Henry Masslin [...]

>
>He achieved remarkably good performance (an order of magnitude
>better than conventional OS's) on improbable hardware. Unfortunately,
>it's not clear that his techniques are of use to any but the most
>rarified stratum of programmers. (Perhaps you could argue that
>the usefulness of an idea is proportional to the number of people
>who can use it multiplied by the leverage it gives, but that's a
>topic for another day...)

Alas, I don't know Henry that well, nor do I have the time to follow
what he's doing. At the moment, I have to be a little more concern
about mass storage systems.

Eugene N. Miya

unread,

May 27, 1997, 3:00:00 AM5/27/97

to

In article <5m8aa6$q...@newshost.nmt.edu> yoda...@chelm.cs.nmt.edu

(Victor Yodaiken) writes:
>Yes. But it's quite a great trick to figure out which complexity can
>be discarded.

That's the trick! So far most of it come from frequency, empirical
experience, complains, new hacks, and dissatisfied programmers.

>Of course there are many drawbacks in original design.

One more time: Ken and Dennis did not "design" the system. It evolved.
You don't see the dsw(1) command anymore (delete with switch).
Berkeley added rm -i. Tmg is gone. Have Mash tell you about all that
ancient history. I recover (concentrate on) the astro and satellite stuff.

>I'm not sure if I agree with the record locking critique, but there were
>other flaws. And sure, I'd like to see Norman's file.

I don't need record locking, but the criticism is documented.
Don's interface critique was published in Datamation and can be found
elsewhere like his Voyager CD-ROM. I can ask him the next time I to
Atherton.

Eugene N. Miya

unread,

May 27, 1997, 3:00:00 AM5/27/97

to

Eugene Miya <eug...@cse.ucsc.edu> wrote:
>>The problem is that CS Depts. lack incentives to expose students to
>>diverse hardware, and it's a maintenance nightmare. That becomes an
>>incentive to One-true architecture. One-true OS, and one-true language.

In article <5m4sug$8ej$1...@joe.rice.edu> pre...@cs.rice.edu
(Preston Briggs) writes:
>Indeed, there's a big dis-incentive, since the departments want to
>avoid that maintenance nightmare themselves. Perhaps there should be
>a move back to having students maintain the machines they work on
>(having finished, I can say that).

I wouldn't go quite as far as maintenance, but...
Bill Wulf said in one of his books (where its handy to have the c.s.s. FAQ):

We want to learn about the consequences of different designs on
the useability and performance of multiprocessors.
Unfortunately, each decision we make precludes us from exploring its
alternatives. This is unfortunate, but probably inevitable for hardware.
Perhaps, however, it is not inevitable for the software....
and especially for the facilities provided by the operating system.

Bill had good insights. Too bad he no longer reads comp.arch.
I am thankful to have gotten to know him before the net got too large
and too PC oriented.

Borchers got kicked from LLNL to NSF because of his suggestions to get a
foreign-built supercomputer. I think more serious problems exist than
say National pride or even national economies. Last evening, on PBS,
the National Geographic society had one of their specials showing the
FBI training facility at Quentico (Hogan's Alley). Out in the Mojave desert
similar training goes on as well as thing like Top Gun schools. A few years
back I drafted bits and pieces of a memo on a info warfare sandbox: what
would it take (basically, a resetable mini-Internet). It was based on
Internet worm experience. The draft circulated with a few knowledgeable
hacker friends including one spook agency. But after a while I grew bored
working on it. The problem is that we are educating (we call it education)
a slew of sys admins in very predictable ways. That's not good.
The net was protected by its diversity. Anyways the net also learns by
experience, so the thing now is to just let it poke along and let happen
what will happen, rather than anticipate it. Darwin among the machines.

>On the other hand, CS is more than machines.
>I doubt Tarjan's students would profit too much from
>being sysadmins on a variety of workstations.

Well, that's generally a statement for all theory.

Jan Vorbrueggen

unread,

May 28, 1997, 3:00:00 AM5/28/97

to

eug...@george.arc.nasa.gov (Eugene N. Miya) writes:

> I don't need record locking, but the criticism is documented.

You don't? You're saying you'll never use parallelized database software, for
instance, or access the same file from multiple processes?

Jan

Eugene N. Miya

unread,

May 28, 1997, 3:00:00 AM5/28/97

to

Two added notes:

There is hope: Or is there?
I ran into a CS MS graduate who never heard of the Turing test.
I know some people would say this is good, whereas traditional CS profs
would say that's bad.

Dennis, keeping me honest, emailed:

> Berkeley added rm -i.

Base calumny. It was in v7, though not in v6. However, dsw was
still in v6, even though it no longer read the console switches.

Regards,
Dennis