For those who need a lot of power...

Del Cecchi

unread,

Nov 15, 2002, 10:09:00 AM11/15/02

to

I just saw a press release for the new P655 which has up to 128 power4 processors
in a single frame. Just the thing to power your home network :-)

It gave some performance information also, even including streams.
--

Del Cecchi
cec...@us.ibm.com
Personal Opinions Only

McCalpin

unread,

Nov 15, 2002, 2:19:18 PM11/15/02

to

In article <ar32mc$sm0$1...@news.rchland.ibm.com>,

Del Cecchi <cec...@us.ibm.com> wrote:
>I just saw a press release for the new P655 which has up to 128
>power4 processors in a single frame. Just the thing to power your
>home network :-)

Or heat your house in those cold Minnesota winters....

>It gave some performance information also, even including streams.

Hmmmm, where did you find a press release that included STREAM
data? All the versions that I have seen had that information
removed....

Anyway, my good twin brother (I am the evil twin) published
p655 STREAM numbers at the STREAM web site on Tuesday.

It may be interesting to note that the 4-way p655/651A delivers
almost as much sustained bandwidth (according to the "standard"
STREAM benchmark results) as the 32 cpu HP SuperDome or the 32 cpu
Origin3800 or the 16 way HP AlphaServer GS series or the 16-way
Sun F15K (my estimate, based on their 72 cpu publication), and
that a single rack of p655 systems delivers more aggregate STREAM
bandwidth than an NEC SX-6 (according to the p655 "tuned" STREAM
benchmark result).

Not too shabby....
--
John D. McCalpin, Ph.D. mcca...@austin.ibm.com
Senior Technical Staff Member IBM POWER Microprocessor Development
"I am willing to make mistakes as long as
someone else is willing to learn from them."

Sander Vesik

unread,

Nov 15, 2002, 2:43:19 PM11/15/02

to

Del Cecchi <cec...@signa.rchland.ibm.com> wrote:
> I just saw a press release for the new P655 which has up to 128 power4 processors
> in a single frame. Just the thing to power your home network :-)
>
> It gave some performance information also, even including streams.

Yeah, but it looks like it is in fact "a cluster in a rack" made up from
4-way p655 nodes and not a single address space machine?

--
Sander

+++ Out of cheese error +++

Robert Myers

unread,

Nov 15, 2002, 3:25:43 PM11/15/02

to

Sander Vesik wrote:

> Del Cecchi wrote:
>
> >I just saw a press release for the new P655 which has up to 128
> power4 processors
> >in a single frame. Just the thing to power your home network :-)
> >
> >It gave some performance information also, even including streams.
>
>

> Yeah, but it looks like it is in fact "a cluster in a rack" made up from
> 4-way p655 nodes and not a single address space machine?
>

And the downside is...?

Unless you're using raw iron, the OS makes NUMA transparent, no?

How else do you get arbitrarily high memory bandwidths?

McCalpin

unread,

Nov 15, 2002, 3:53:39 PM11/15/02

to

In article <10373894...@haldjas.folklore.ee>,

Yes, it is definitely a "cluster in rack", not a single address
space system.

IBM has sold billions of dollars of such systems under the "IBM
SP" branding, to both technical and commercial customers. The
last time I checked, such clusters made up something like 50% of
the global supercomputing market (by revenue), and made up
something like 30% of the (much larger) technical mid-range
market ($250k-$1M system price band).

Rob Young

unread,

Nov 15, 2002, 3:22:46 PM11/15/02

to

In article <ar3hbm$hu6$1...@ausnews.austin.ibm.com>, mcca...@gmp246.austin.ibm.com (McCalpin) writes:

>
> Anyway, my good twin brother (I am the evil twin) published
> p655 STREAM numbers at the STREAM web site on Tuesday.
>
> It may be interesting to note that the 4-way p655/651A delivers
> almost as much sustained bandwidth (according to the "standard"
> STREAM benchmark results) as the 32 cpu HP SuperDome or the 32 cpu
> Origin3800 or the 16 way HP AlphaServer GS series or the 16-way
> Sun F15K (my estimate, based on their 72 cpu publication),

Snapshot in time, but impressive.

> and
> that a single rack of p655 systems delivers more aggregate STREAM
> bandwidth than an NEC SX-6 (according to the p655 "tuned" STREAM
> benchmark result).
>
> Not too shabby....

The p655A/651B (8-way) is doing 10400 MB/sec , not bad.

EV7 8-way will be doing over 40000 MB/sec , not bad either.

See slide 31:

http://www.eecs.umich.edu/vlsi_seminar/f01/slides/bannon.pdf

(Note the scale is off, that should be MB not GB).

However, I don't believe you can order EV7 until January.

Rob

Del Cecchi

unread,

Nov 15, 2002, 4:13:12 PM11/15/02

to

In article <ar3hbm$hu6$1...@ausnews.austin.ibm.com>,
mcca...@gmp246.austin.ibm.com (McCalpin) writes:

It was a press release out of London....

Here is the whole thing. I'm too lazy today to edit out the marketeze. Sorry.

And you may have to register to get the article from freerealtime.com, but they
have good stock quotes.

from
http://quotes.freerealtime.com/rt/frt/N?symbol=IBM&art=C2002111500319u5831&SA=Latest%20News

IBM: New IBM supercomputer packs 128 POWER4 processors in a single
frame Smaller, denser, faster eServer to transform supercomputing
industry

London, Nov 15, 2002 (M2 PRESSWIRE via COMTEX) -- IBM today announced an ultra
dense UNIX(r) server targeted at the High Performance Computing market. Capable
of reaching half a trillion operations per second in peak processing power, the
new eServer packs up to 128 POWER4 processors in a single rack and is available
in four or eight processor building blocks.

The eServer p655 is designed to meet the stringent demands of scientific and
technical computing and Business Intelligence, as well as those applications
that require very large databases or massively parallel processing including
digital media and life sciences.

The eServer p655 is the next generation of the IBM supercomputer made famous by
Deep Blue, the system that defeated chess champion Garry Kasparov. Since Deep
Blue, IBM has been a leader in supercomputing, dominating the popular Top500List
of Supercomputers. IBM supercomputers are used in life sciences to explore the
genomic research, in automobile design to make cars safer, and in financial
markets to optimise investment strategies.

The eServer p655 continues to leverage IBM breakthrough microprocessor
technology to deliver density and price/performance that make it a superior
alternative for customers thinking of deploying Itanium(r)2 server clusters.

A single eServer p655 rack with 128 POWER4 processors occupies as little as
one-sixth the floor space of an HP rx5670 Itanium2 system with the same number
of processors (1). Additionally, a 4-way eServer p655 with 1.3 GHz POWER4
processors has a SPECfp_rate2000 of 51.7, offering 15 percent better throughput
than a HP rx5670 with four processors. (1)

In measurement of sustained memory bandwidth, the eServer p655 is almost 2.5
times the peak theoretical memory bandwidth of the HP Itanium 2 systems (6.4
GB/sec2) in a tuned version of the STREAM benchmark
(http://www.cs.virginia.edu/stream/). (3)

"Tomorrow's innovations are rooted in the bedrock of supercomputing
technologies, where IBM brings decades of experience to the table," said Adalio
Sanchez, General manager of IBM eServer pSeries.

"The eServer p655 is a significant breakthrough for customers who need massive
performance power and scalability without massive computing space and costs. The
flexibility and compact design of the POWER4-based p655 make it an easy winner
over competing systems, and sets a new standard by which all newcomers will be
measured."

The eServer p655 system can be clustered using eServer cluster 1600, combined
using a high-performance switch. For greater flexibility, systems can be defined
using logical portioning. Cluster systems administration from a single control
workstation is provided by IBM's proven cluster management software offering.

Customers who operate large server clusters demand uncompromising availability
-- and the p655 delivers. From the integrated service processor to the Chipkill
and bit-steering memory technologies, the eServer p655 offers enterprise-class
autonomic computing capability. From a performance, packaging, flexibility or
availability point of view, the p655 is ideal for customers whose workloads are
best managed with a clustered server solution.

The starting price for the eServer p655 is GBP57,359 (4) and is expected to ship
in volume later this year.

The p655 will run the AIX 5L(tm) operating system, including Version 5.1, and
Linux. (5)

About IBM

IBM is the world's largest information technology company, with 80 years of
leadership in helping businesses innovate. Drawing on resources from across IBM
and key Business Partners, IBM offers a wide range of services, solutions and
technologies that enable customers, large and small, to take full advantage of
the new era of e-business.

For more information about IBM, visit www.ibm.com.

The IBM eServer brand consists of the established IBM e-business logo with the
following descriptive term "server'' following it. IBM, the e-business logo,
pSeries are trademarks of IBM Corporation in the United States and/or other
countries. Intel and Itanium trademarks or registered trademarks of Intel
Corporation. Linux is a registered trademark of Linus Torvalds. All other
company/product names and service marks may be trademarks or registered
trademarks of their respective companies.

Footnotes:

1 Based on SPECfp_rate2000 result of 51.7 for the IBM p655 4-WAY 1.3 GHz with
POWER4 processor and 43.7 for the HP rx5670 4-WAY 1.0 GHz with Itanium(r) 2
processor. HP result posted on www.spec.org/cpu2000 as of November 14, 2002. IBM
result submitted to SPEC on 11/11/02.

3 STREAM Benchmark Data for p655 and Leading Competitive Systems (Source:
http://www.cs.virginia.edu/stream/results filed November 12, 2002) Itanium2 data
from "HP server and workstation performance for technical applications: hp
servers and workstations with Intel Itanium processors", An Executive White
Paper From HP, July 2002, p7

4 Based on p655 4-way with 1.3 GHz POWER4 processors, 4GB memory and 2 18.2GB
disk. Prices subject to change without notice. Reseller prices may vary.

5 IBM anticipates that one or more Linux distributors will support 64-bit Linux
in the first half of 2003.

M2 Communications Ltd disclaims all liability for information provided within M2
PressWIRE. Data supplied by named party/parties. Further information on M2
PressWIRE can be obtained at http://www.presswire.net on the world wide web.
Inquiries to in...@m2.com.

(C)1994-2002 M2 COMMUNICATIONS LTD

DJIA8,578.2636.13

Nasdaq1,410.530.99

S&P 500909.725.45

10 Yr Bond4.050.03

8,700 No-Fee Mutual Funds

Keyword Search
Enter Keyword

Data and information is provided for informational purposes only, and is
not intended for trading purposes. Neither FreeRealTime.com nor any of its
data or content providers shall be liable for any errors or delays in the
content, or for any actions taken in reliance thereon. By accessing the
FreeRealTime.com site, a member has agreed to the FreeRealTime.com Member
Agreement.

Copyright © 1998-2002 FreeRealTime.com, Inc. All rights reserved.
User Agreement, Privacy Statement, Version 3.50

Alexis Cousein

unread,

Nov 15, 2002, 4:48:51 PM11/15/02

to cec...@us.ibm.com

Del Cecchi wrote:

> I just saw a press release for the new P655 which has up to 128 power4 processors
> in a single frame.

But not a single system ;).

Nick Maclaren

unread,

Nov 15, 2002, 4:50:01 PM11/15/02

to

In article <ar3hbm$hu6$1...@ausnews.austin.ibm.com>,

McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
>
>Anyway, my good twin brother (I am the evil twin) published
>p655 STREAM numbers at the STREAM web site on Tuesday.
>
>It may be interesting to note that the 4-way p655/651A delivers
>almost as much sustained bandwidth (according to the "standard"
>STREAM benchmark results) as the 32 cpu HP SuperDome or the 32 cpu
>Origin3800 or the 16 way HP AlphaServer GS series or the 16-way
>Sun F15K (my estimate, based on their 72 cpu publication), and
>that a single rack of p655 systems delivers more aggregate STREAM
>bandwidth than an NEC SX-6 (according to the p655 "tuned" STREAM
>benchmark result).

That's with 16 MB pages and the recently released compilers, isn't it?

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679

Alexis Cousein

unread,

Nov 15, 2002, 4:50:47 PM11/15/02

to Robert Myers

Robert Myers wrote:

> Sander Vesik wrote:
>
> > Del Cecchi wrote:
> >

> > >I just saw a press release for the new P655 which has up to 128
> > power4 processors
> > >in a single frame. Just the thing to power your home network :-)
> > >
> > >It gave some performance information also, even including streams.
> >
> >

> > Yeah, but it looks like it is in fact "a cluster in a rack" made up from
> > 4-way p655 nodes and not a single address space machine?
> >
> And the downside is...?
>
> Unless you're using raw iron, the OS makes NUMA transparent, no?

No. Note John made sure he quoted the 4 cpu numbers. Oh, and the 128 cpu
per rack model actually doesn't have the same STREAM numbers per CPU, but
that's just a slight omission.

>
>
> How else do you get arbitrarily high memory bandwidths?

By just summing the STREAM numbers from the nodes...

>
>

Nick Maclaren

unread,

Nov 15, 2002, 4:53:53 PM11/15/02

to

In article <bLcB9.42987$1O2.4473@sccrnsc04>,
Robert Myers <rmyer...@attbi.com> wrote:

>Sander Vesik wrote:
>>
>> Yeah, but it looks like it is in fact "a cluster in a rack" made up from
>> 4-way p655 nodes and not a single address space machine?
>>
>And the downside is...?
>
>Unless you're using raw iron, the OS makes NUMA transparent, no?

No. It is a distributed memory machine. Each unit of 4 CPUs is a
(probably slightly NUMA) system in its own right.

Robert Myers

unread,

Nov 15, 2002, 10:12:51 PM11/15/02

to

Alexis Cousein wrote:

> Robert Myers wrote:
>
> > Sander Vesik wrote:
> >
> > > Del Cecchi wrote:
> > >

> > > >I just saw a press release for the new P655 which has up to 128
> > > power4 processors
> > > >in a single frame. Just the thing to power your home network :-)
> > > >
> > > >It gave some performance information also, even including streams.
> > >
> > >

> > > Yeah, but it looks like it is in fact "a cluster in a rack" made up
> > from
> > > 4-way p655 nodes and not a single address space machine?
> > >
> > And the downside is...?
> >
> > Unless you're using raw iron, the OS makes NUMA transparent, no?
>
> No. Note John made sure he quoted the 4 cpu numbers. Oh, and the 128 cpu
> per rack model actually doesn't have the same STREAM numbers per CPU, but
> that's just a slight omission.

Thank you for that clarification. I really wasn't sure of how to
interpret John's post, given the enormous apparent per-CPU peformance
discrepancy as compared to machines from other manufacturer's.

> >
> >
> > How else do you get arbitrarily high memory bandwidths?
>
>
>
> By just summing the STREAM numbers from the nodes...
>

That much was clear, anyway, if I had understood what the original
STREAM numbers really meant. What I was really asking is related to the
comment in Nick McLaren's post. The high per-CPU bandwidth is probably
related to the fact that each of the quad-processor units is NUMA (and
if it's not, someone will surely point it out!).

I can do my own reading on IBM's website. Thanks to both of you for the
clarifications.

McCalpin

unread,

Nov 15, 2002, 10:34:58 PM11/15/02

to

In article <ar3q69$37p$1...@pegasus.csx.cam.ac.uk>,

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>In article <ar3hbm$hu6$1...@ausnews.austin.ibm.com>,
>McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
>>
>>Anyway, my good twin brother (I am the evil twin) published
>>p655 STREAM numbers at the STREAM web site on Tuesday.
>>
>>It may be interesting to note that the 4-way p655/651A delivers
>>almost as much sustained bandwidth (according to the "standard"
>>STREAM benchmark results) as the 32 cpu HP SuperDome or the 32 cpu
>>Origin3800 or the 16 way HP AlphaServer GS series or the 16-way
>>Sun F15K (my estimate, based on their 72 cpu publication), and
>>that a single rack of p655 systems delivers more aggregate STREAM
>>bandwidth than an NEC SX-6 (according to the p655 "tuned" STREAM
>>benchmark result).
>
>That's with 16 MB pages and the recently released compilers, isn't it?

The runs used large pages, but the compiler level does not make
a significant difference.

Large pages are very helpful in getting full bandwidth on the 4p
systems (almost a 50% boost on tuned STREAM Triad), but the
differences are smaller on the 8p box -- varying from 10%-20%
depending on the experiment.

McCalpin

unread,

Nov 15, 2002, 10:36:38 PM11/15/02

to

In article <3DD56C37...@brussels.sgi.com>,

Alexis Cousein <a...@brussels.sgi.com> wrote:
>
>No. Note John made sure he quoted the 4 cpu numbers. Oh, and the 128 cpu
>per rack model actually doesn't have the same STREAM numbers per CPU, but
>that's just a slight omission.

The 4 cpu node gives 64p per rack and the 8p node gives 128p per
rack. Either way you get about 250 GB/s in a fully configured rack.

McCalpin

unread,

Nov 15, 2002, 10:40:47 PM11/15/02

to

In article <TIiB9.47694$1O2.4525@sccrnsc04>,

Robert Myers <rmyer...@attbi.com> wrote:
>The high per-CPU bandwidth is probably
>related to the fact that each of the quad-processor units is NUMA (and
>if it's not, someone will surely point it out!).

There is no NUMA here -- this is a cluster of (physically) independent
4-way and/or 8-way SMPs stuffed into the same rack. They will typically
be managed by IBM's cluster management software as a single logical
entity, but from the user point of view they are separate SMPs.

McCalpin

unread,

Nov 15, 2002, 10:43:12 PM11/15/02

to

In article <ar3qdh$39g$1...@pegasus.csx.cam.ac.uk>,

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>In article <bLcB9.42987$1O2.4473@sccrnsc04>,
>Robert Myers <rmyer...@attbi.com> wrote:
>>Sander Vesik wrote:
>>>
>>> Yeah, but it looks like it is in fact "a cluster in a rack" made up from
>>> 4-way p655 nodes and not a single address space machine?
>>>
>>And the downside is...?
>>
>>Unless you're using raw iron, the OS makes NUMA transparent, no?
>
>No. It is a distributed memory machine. Each unit of 4 CPUs is a
>(probably slightly NUMA) system in its own right.

Memory is interleaved at 512 Byte granularity around the four memory
controllers, so this is a "flat" SMP according to the most common
terminology. Every fourth 512 Byte region has a memory and L3 cache
latency that is a few cycles faster than the other three, but this
is not detectable in practice.

Sander Vesik

unread,

Nov 16, 2002, 3:42:05 PM11/16/02

to

I don't think that there is auch a version of AIX or Linux that
would do this to a rack full of 4p nodes.

>
> How else do you get arbitrarily high memory bandwidths?
>

This assumes that whatever you are doing works nicely in a cluster.

Sander Vesik

unread,

Nov 16, 2002, 4:00:06 PM11/16/02

to

McCalpin <mcca...@gmp246.austin.ibm.com> wrote:
> In article <10373894...@haldjas.folklore.ee>,
> Sander Vesik <san...@haldjas.folklore.ee> wrote:
>>Del Cecchi <cec...@signa.rchland.ibm.com> wrote:
>>> I just saw a press release for the new P655 which has up to 128
>>> power4 processors in a single frame. Just the thing to power your
>>> home network :-)
>
>>> It gave some performance information also, even including streams.
>>
>>Yeah, but it looks like it is in fact "a cluster in a rack" made up from
>>4-way p655 nodes and not a single address space machine?
>
> Yes, it is definitely a "cluster in rack", not a single address
> space system.

Right, quite some press-digested versions implied this was not the case.

>
> IBM has sold billions of dollars of such systems under the "IBM
> SP" branding, to both technical and commercial customers. The
> last time I checked, such clusters made up something like 50% of
> the global supercomputing market (by revenue), and made up
> something like 30% of the (much larger) technical mid-range
> market ($250k-$1M system price band).

I wasn't implying there was something wrong with clusters. Just several
interpretations (really also Del's "up to 128 processors in a frame")
seemed to imply it was a shared memory machine.

del cecchi

unread,

Nov 16, 2002, 4:33:07 PM11/16/02

to

"Sander Vesik" <san...@haldjas.folklore.ee> wrote in message
news:10374804...@haldjas.folklore.ee...

It is up to 128 processors in a frame. But I wasn't trying to mislead
anyone.
The issue of whether in practice a large cluster is superior or inferior
when compared to a similarily large NUMA seems to be open.

del cecchi

Rupert Pigott

unread,

Nov 16, 2002, 5:11:21 PM11/16/02

to

"del cecchi" <dce...@msn.com> wrote in message
news:gRyB9.140$%c2....@eagle.america.net...
[SNIP]

> It is up to 128 processors in a frame. But I wasn't trying to mislead
> anyone.
> The issue of whether in practice a large cluster is superior or inferior
> when compared to a similarily large NUMA seems to be open.

Depends on the decade you are in as far as I can tell. :)

Cheers,
Rupert

del cecchi

unread,

Nov 16, 2002, 7:48:39 PM11/16/02

to

"Rupert Pigott" <dark...@hotmail.com> wrote in message
news:ar6fqp$3k9$1$8300...@news.demon.co.uk...

OK, I'll bite. At the current time, what is the concensus of cluster
vrs ccNUMA at the 128 processor level? What are the advantages and
disadvantages of each? (This is not a homework problem, so watch the
funny answers). Are NUMA machines really used much as one large machine
or are they mostly partitioned and look like a cluster anyway?

Curious circuit designers want to know.

del cecchi
>

Robert Myers

unread,

Nov 16, 2002, 7:52:49 PM11/16/02

to

McCalpin wrote:

>
> Memory is interleaved at 512 Byte granularity around the four memory
> controllers, so this is a "flat" SMP according to the most common
> terminology. Every fourth 512 Byte region has a memory and L3 cache
> latency that is a few cycles faster than the other three, but this is
> not detectable in practice.

At the risk of being flamed, let me offer a quote from an IBM RedBook,
"Understanding IBM pSeries e-server Performance and Sizing":

"On a typical implementation, it will take one cycle to access data from
L1 if there is a cache hit in L1. It will take between seven to 10
cycles to access data from L2 in case of a cache miss in L1 and a cache
hit in L2. It will take between 20 to 50 cycles to get data from memory
in case of a cache miss in L2."

The same document offers that caveat that

"Detailed values are hardware dependent. These numbers should only be
used as guidelines."

I am assuming that the very broad 20-50 cycle delay on an L2 cache miss
depends on whether the data are to be found in L3 or not. Since a
four-way Power-4 uses four different L2 caches, I am assuming that the
difference between a seven cycle delay and a ten cycle delay on an L1
miss is whether or not the data are in the L2 cache that belongs to a
particular processor or not. That is, the delay on an L1 cache miss
depends on whether or not the processor has to use on-chip bus that
connects the four CPU's to access the L2 of another processor. One
suspects that contention for that bus could become a serious problem in
some circumstances.

None of this, I suspect, is relevant to the STREAM benchmark you quoted
because I suspect that you turned off the L2 and L3 caches for purposes
of the benchmark. Or not?

The same document suggests that it is possible for a processor to bypass
the cache hierarchy altogether:

"Large IBM SMP systems are therefore designed in a different way. There
is still a mechanism for the snooping activity and the addressing, but
another component has been added for data transfers. That component is a
switch. The switch allows point-to-point connections between a
processor and another processor or between a processor and the memory.
It also allows several simultaneous transfers."

Figure 29 of that document, "Using a switch for data transfers," draws a
picture that suggests that for such transfers, each processor is equally
"close" to memory and thus that the machine is, as you said, definitely
not NUMA.

After this long setup, I have two questions:

1. Is the switch/cache-hierarchy mecanism transparent to the user? If
not, how is it controlled?

2. If IBM can design a switch to so transparently turn what would
otherwise be a NUMA machine into a UMA machine for four processors or
eight, why can't it do so for 16, 32, or whatever?

Robert Myers

unread,

Nov 16, 2002, 8:04:59 PM11/16/02

to

del cecchi wrote:

> "Rupert Pigott" wrote in message
> news:ar6fqp$3k9$1$8300...@news.demon.co.uk...
>
> >"del cecchi" wrote in message

> >news:gRyB9.140$%c2....@eagle.america.net...
> >[SNIP]
> >
> >
> >>It is up to 128 processors in a frame. But I wasn't trying to
>
> mislead
>
> >>anyone.
> >>The issue of whether in practice a large cluster is superior or
>
> inferior
>
> >>when compared to a similarily large NUMA seems to be open.
> >
> >Depends on the decade you are in as far as I can tell. :)
> >
> >Cheers,
> >Rupert
> >
>
> OK, I'll bite. At the current time, what is the concensus of cluster
> vrs ccNUMA at the 128 processor level? What are the advantages and
> disadvantages of each? (This is not a homework problem, so watch the
> funny answers). Are NUMA machines really used much as one large machine
> or are they mostly partitioned and look like a cluster anyway?
>
> Curious circuit designers want to know.
>
> del cecchi
>

If, as Aaron Spink seemed to suggest in one of the recent threads on
Infiniband, you can include the memory accessed via RDMA over Inifiband
into the coherency domain of the processor making the request and of the
processor thus being accessed, the distinction shouldn't last long
anyway! No wonder IBM is an enthusiastic backer of Infiniband.

Greg Lindahl

unread,

Nov 16, 2002, 8:33:18 PM11/16/02

to

In article <jIBB9.148$%c2....@eagle.america.net>,
del cecchi <dce...@msn.com> wrote:

>OK, I'll bite. At the current time, what is the concensus of cluster
>vrs ccNUMA at the 128 processor level?

We can't even agree on what the word "consensus" means, much less have
one.

> (This is not a homework problem, so watch the funny answers).

Hey, if we didn't laugh about it, we'd cry.

> Are NUMA machines really used much as one large machine
> or are they mostly partitioned and look like a cluster anyway?

Most places seem to buy capability machines and then use them as
capacity machines.

-- greg

Colin Andrew Percival

unread,

Nov 16, 2002, 8:46:14 PM11/16/02

to

Except for universities, which tend to build capacity machines, run Linpack
on them, issue a press release stating that they are #xxx on the "Top 500"
list, and then watch the machines go completely unused because nobody has
the necessary combination of large problems and programming knowledge.

Colin Percival

Bill Todd

unread,

Nov 16, 2002, 9:21:43 PM11/16/02

to

"del cecchi" <dce...@msn.com> wrote in message

news:jIBB9.148$%c2....@eagle.america.net...

...

> OK, I'll bite. At the current time, what is the concensus of cluster
> vrs ccNUMA at the 128 processor level? What are the advantages and
> disadvantages of each? (This is not a homework problem, so watch the
> funny answers). Are NUMA machines really used much as one large machine
> or are they mostly partitioned and look like a cluster anyway?

Cluster approach advantage: If it does the job, it's almost certainly
cheaper than an equivalently-partitioned NUMA box.

Great Honking ccNUMA Box advantage: It's a lot more flexible than an
equivalent number of processors broken up over separate nodes: you can
start with a smaller SMP-style application and have far more headroom before
you need to rewrite it to partition it across a cluster; you can consolidate
a bunch of such applications in partitions and then adjust the resources
associated with the partitions as relative loads change; you can even
accommodate varying loads dynamically (IIRC reassigning a processor from one
partition to another takes well under 100 microseconds on Alpha boxes) -
which may allow you to make better average use of the total resource and
thus reduce costs to a level more like those of clusters (one of the main
arguments for advanced storage virtualization that would seem equally
applicable here).

So while there may not be all that many single applications that require a
128-processor ccNUMA box to run on, they can be useful for many other
endeavors by offering partitioning flexibility that fixed-node-size clusters
do not.

(Yes, all that's pretty obvious, but what the hell: you did ask.)

- bill

Greg Lindahl

unread,

Nov 16, 2002, 9:46:56 PM11/16/02

to

In article <ar6sd6$4kj$1...@morgoth.sfu.ca>,

Colin Andrew Percival <cper...@sfu.ca> wrote:

>> Most places seem to buy capability machines and then use them as
>> capacity machines.
>
>Except for universities, which tend to build capacity machines, run Linpack
>on them, issue a press release stating that they are #xxx on the "Top 500"
>list, and then watch the machines go completely unused because nobody has
>the necessary combination of large problems and programming knowledge.

You seem to have reversed capability and capacity. Most universities
have lots of researchers who want to run serial programs; these get
along quite well with a capacity machine, with no reprogramming, and
even for small problems.

greg

Colin Andrew Percival

unread,

Nov 16, 2002, 10:16:36 PM11/16/02

to

Greg Lindahl <lin...@pbm.com> wrote:
> You seem to have reversed capability and capacity. Most universities
> have lots of researchers who want to run serial programs; these get
> along quite well with a capacity machine, with no reprogramming, and
> even for small problems.

In the cases I was thinking of, people were discouraged from using
individual processors by themselves; there was an attitude of "if you're
going to use the cluster at all, make sure you're doing something which you
can't do on your desktop PC" -- ie, non-serial programs.

Colin Percival

Nick Maclaren

unread,

Nov 17, 2002, 7:24:16 AM11/17/02

to

In article <3dd6f17d$1...@news.meer.net>, Greg Lindahl <lin...@pbm.com> wrote:
>In article <jIBB9.148$%c2....@eagle.america.net>,
>del cecchi <dce...@msn.com> wrote:
>
>>OK, I'll bite. At the current time, what is the concensus of cluster
>>vrs ccNUMA at the 128 processor level?
>
>We can't even agree on what the word "consensus" means, much less have
>one.

Very true. Actually, there are three main camps, anyway: ccNUMA,
high-performance interconnect clusters and low-performance interconnect
ones. For example, we can use either of the first two effectively but
not the third. The people who can use the third are well advised to
do so, as you can get 70% of the bang for as little as 30% of the
buck.

While it is easier and cheaper to build a high-performance interconnect
cluster than a ccNUMA one for 128 CPUs, it is neither easy nor cheap.
It is dead easy to connect 128 fast desktops with cheap Ethernet.

Harking back to this new announcement, a P655 box is c. 30% heavier
than the P690 and probably considerably more power hungry, but
delivers 3-4 times as much CPU performance, (as I understand it) a
mind-blowing 7.9 times as much memory performance and 16 times as
much interconnect bandwidth (or 4 times as much as 8-CPU LPARs).
This makes it a MUCH better solution for applications that can use
either ccNUMA or a high-performance interconnect cluster.

Robert Myers

unread,

Nov 17, 2002, 8:53:37 AM11/17/02

to

Nick Maclaren wrote:

> In article <3dd6f17d$1...@news.meer.net>, Greg Lindahl wrote:
>
> >In article ,

> >del cecchi wrote:
> >
> >
> >>OK, I'll bite. At the current time, what is the concensus of cluster
> >>vrs ccNUMA at the 128 processor level?
> >
> >We can't even agree on what the word "consensus" means, much less have
> >one.
>
>
> Very true. Actually, there are three main camps, anyway: ccNUMA,
> high-performance interconnect clusters and low-performance interconnect
> ones. For example, we can use either of the first two effectively but
> not the third.

Presumably you can't use the third because CPU's would sit around too
much waiting for something to arrive so they can do something with it.
The idle time is so high that it takes away any cost advantage of the
low-performance interconnect cluster, even if you can afford to be patient.

> The people who can use the third are well advised to
> do so, as you can get 70% of the bang for as little as 30% of the
> buck.

If the nodes are there anyway, with CPU usage hovering near zero because
that's the way it is these days, your 30% figure seems a little high
8^}. If you can afford to be patient, and the computers are sitting
there nearly idle otherwise, computing on a low performance interconnect
cluster is practically free.

Nick Maclaren

unread,

Nov 17, 2002, 10:33:42 AM11/17/02

to

In article <BbNB9.57787$NH2.3552@sccrnsc01>,

Robert Myers <rmyer...@attbi.com> wrote:
>>
>> Very true. Actually, there are three main camps, anyway: ccNUMA,
>> high-performance interconnect clusters and low-performance interconnect
>> ones. For example, we can use either of the first two effectively but
>> not the third.
>
>Presumably you can't use the third because CPU's would sit around too
>much waiting for something to arrive so they can do something with it.
>The idle time is so high that it takes away any cost advantage of the
>low-performance interconnect cluster, even if you can afford to be patient.

Yes, precisely. People who run a large number of semi-independent
calculations (e.g. quite a lot of Monte-Carlo work) have no such
problem.

M. Ranjit Mathews

unread,

Nov 17, 2002, 10:47:20 AM11/17/02

to

It is transparent to the same extent that the cache in a Dell PC is
ransparent.

> 2. If IBM can design a switch to so transparently turn what would
> otherwise be a NUMA machine into a UMA machine for four processors or
> eight, why can't it do so for 16, 32, or whatever?

To go beyond 8 processors, IBM does that with a RIO switch - up to 32
processors in a Regatta (p690).

Peter Boyle

unread,

Nov 17, 2002, 1:01:15 PM11/17/02

to

On Sun, 17 Nov 2002, Robert Myers wrote:

> > do so, as you can get 70% of the bang for as little as 30% of the
> > buck.
>
> If the nodes are there anyway, with CPU usage hovering near zero because
> that's the way it is these days, your 30% figure seems a little high
> 8^}. If you can afford to be patient, and the computers are sitting
> there nearly idle otherwise, computing on a low performance interconnect
> cluster is practically free.

You forget the total cost of non-ownership. ;)
SETI style takes some doing.
Peter Boyle

Robert Myers

unread,

Nov 17, 2002, 2:18:57 PM11/17/02

to

Peter Boyle wrote:

>
> On Sun, 17 Nov 2002, Robert Myers wrote:
>
>
> >If you can afford to be patient, and the computers are sitting

> >there nearly idle otherwise, computing on a low performance cluster
> >interconnect is practically free.

>
> You forget the total cost of non-ownership. ;)
> SETI style takes some doing.
> Peter Boyle

I wasn't really thinking of SETI-style computing: I was thinking of the
very common case of a business that already owns a dozen or more
computers that are already connected by a LAN and that sit around taking
up floor space, drawing current, and depreciating merrily away whether
they do anything useful or not.

The more general case is enticing, but, as you imply, maybe not even
worth it.

What is more frustrating is the almost complete absence of tools for
even the easy case. Applications to exploit the situation are virtually
non-existent, and tools to create the applications are so primitive that
the hardware costs become negligible compared to the labor costs.

That's part of why I think Intel's hypethreading is a real contribution
to computing even if it turns out to be mostly just hyperhype. AFAICT,
there is no production compiler that would even *try* to multi-thread an
application, and I expect the appearance of hyperthreading to change that.

While such a compiler would be of no use for the low-cost interconnect
situation, it is but a short step from such a compiler to tools that would.

Nick Maclaren

unread,

Nov 17, 2002, 3:05:41 PM11/17/02

to

In article <BYRB9.68950$1O2.4084@sccrnsc04>,
Robert Myers <rmyer...@attbi.com> wrote:

>Peter Boyle wrote:
>
>I wasn't really thinking of SETI-style computing: I was thinking of the
>very common case of a business that already owns a dozen or more
>computers that are already connected by a LAN and that sit around taking
>up floor space, drawing current, and depreciating merrily away whether
>they do anything useful or not.

It costs more to set them up as a cluster than you might think. You
have to coordinate their configuration, and handle the cases of when
they don't run a job, and so on. Even if they are all administered
by a central team (which, inter alia, means that they don't run any
system like Windows 2000), it is a pain.

del cecchi

unread,

Nov 17, 2002, 7:35:29 PM11/17/02

to

"M. Ranjit Mathews" <ranjit_...@yahoo.com> wrote in message
news:3DD7ADC2...@yahoo.com...

Communication via RIO is not ccNUMA. There are 8 processors and 4
chips/MCM. I thought that a P690 could have 4 MCMs on a board for a 32
way. Is my memory failing me? I am not going looking over a 53K line.

del cecchi
>

del cecchi

unread,

Nov 17, 2002, 7:40:48 PM11/17/02

to

"Robert Myers" <rmyer...@attbi.com> wrote in message
news:BYRB9.68950$1O2.4084@sccrnsc04...

> Peter Boyle wrote:
>
> >
> > On Sun, 17 Nov 2002, Robert Myers wrote:
> >
> >
> > >If you can afford to be patient, and the computers are sitting
> > >there nearly idle otherwise, computing on a low performance cluster
> > >interconnect is practically free.
> >
> > You forget the total cost of non-ownership. ;)
> > SETI style takes some doing.
> > Peter Boyle
>
> I wasn't really thinking of SETI-style computing: I was thinking of
the
> very common case of a business that already owns a dozen or more
> computers that are already connected by a LAN and that sit around
taking
> up floor space, drawing current, and depreciating merrily away whether
> they do anything useful or not.

In that case, several IBM labs have "computers" that might even make the
top 500 made up of all the workstations in the lab, connected by
ethernet/token ring, and used to run logic simulation test cases and
circuit simulation.

Wow. Publicity here we come. :-)

Colin Andrew Percival

unread,

Nov 17, 2002, 8:05:27 PM11/17/02

to

del cecchi <dce...@msn.com> wrote:
> In that case, several IBM labs have "computers" that might even make the
> top 500 made up of all the workstations in the lab, connected by
> ethernet/token ring, and used to run logic simulation test cases and
> circuit simulation.

Yes, but can they solve dense systems of linear equations quickly? When you
throw lots of workstations together, it's easy to get a MTBF which is lower
than the time required to run the benchmark...

Colin Percival

Robert Myers

unread,

Nov 17, 2002, 9:42:45 PM11/17/02

to

Colin Andrew Percival wrote:

There are embarrassingly parallel methods for obtaining aribtrarily
accurate approximations to the solutions of dense linear systems that
would be useful in many situations that arise in actual practice and
that would be very robust with respect to the failure of one or more nodes.

McCalpin

unread,

Nov 17, 2002, 9:48:43 PM11/17/02

to

In article <3dd6f17d$1...@news.meer.net>, Greg Lindahl <lin...@pbm.com> wrote:
>Most places seem to buy capability machines and then use them as
>capacity machines.

At the risk of being flame-broiled by my fellow attendees at
SuperComputing'2002, I suggest that the most common scenario is
to buy a capacity system while pretending that it is a capability
system, then using it as a capacity system.

McCalpin

unread,

Nov 17, 2002, 10:08:10 PM11/17/02

to

In article <ALBB9.60043$QZ.9877@sccrnsc02>,

Robert Myers <rmyer...@attbi.com> wrote:
>McCalpin wrote:
>> Memory is interleaved at 512 Byte granularity around the four memory
>> controllers, so this is a "flat" SMP according to the most common
>> terminology. Every fourth 512 Byte region has a memory and L3 cache
>> latency that is a few cycles faster than the other three, but this is
>> not detectable in practice.
>
>At the risk of being flamed, let me offer a quote from an IBM RedBook,
>"Understanding IBM pSeries e-server Performance and Sizing":
>
>"On a typical implementation, it will take one cycle to access data from
>L1 if there is a cache hit in L1. It will take between seven to 10
>cycles to access data from L2 in case of a cache miss in L1 and a cache
>hit in L2. It will take between 20 to 50 cycles to get data from memory
>in case of a cache miss in L2."

These numbers are not even close to the correct values for POWER4
systems....

>I am assuming that the very broad 20-50 cycle delay on an L2 cache miss
>depends on whether the data are to be found in L3 or not.

It is probably not wise to try to make even broad generalizations
from these numbers, since they correspond to older generations of
IBM pSeries systems.

It has been many years since IBM made a system with 20-50 cycle memory
latency....

>None of this, I suspect, is relevant to the STREAM benchmark you quoted
>because I suspect that you turned off the L2 and L3 caches for purposes
>of the benchmark. Or not?

Absolutely not. The published STREAM benchmark numbers are for
ordinary, supported configurations. The "tuned" numbers use a
PowerPC cache op to avoid reading the store target into the L3
or L2 caches, but the system is run in a normal, supported state.

>The same document suggests that it is possible for a processor to bypass
>the cache hierarchy altogether:
>
>"Large IBM SMP systems are therefore designed in a different way. There
>is still a mechanism for the snooping activity and the addressing, but
>another component has been added for data transfers. That component is a
>switch. The switch allows point-to-point connections between a
>processor and another processor or between a processor and the memory.
>It also allows several simultaneous transfers."

That is a very odd paragraph.

IBM's SP switch is I/O-based. It does not exactly bypass the cache
structure, though I/O does not interact with the caches in exactly the
same way as the cpus do.

Using the switch is more like accessing a disk than it is like accessing
memory -- it requires explicit communications calls (usually via the
MPI or LAPI libraries).

>1. Is the switch/cache-hierarchy mecanism transparent to the user? If
>not, how is it controlled?

In general, using the switch is visible to the user, and usually
requires explicit coding.

>2. If IBM can design a switch to so transparently turn what would
>otherwise be a NUMA machine into a UMA machine for four processors or
>eight, why can't it do so for 16, 32, or whatever?

You are welcome to port a software Distributed Shared Memory system
(e.g. TreadMarks) to the POWER4 platform (maybe even under Linux).
This would allow construction of a NUMA system with reasonable
transparency for user-mode applications -- as long as you did not
time them.....

Big IBM clusters are reasonably UMA between nodes for message-
passing, but this should not be confused with shared memory.

Robert Myers

unread,

Nov 17, 2002, 11:42:37 PM11/17/02

to

McCalpin wrote:

> In article ,

> Robert Myers wrote:
>
> >McCalpin wrote:
> >
> >>Memory is interleaved at 512 Byte granularity around the four memory
> >>controllers, so this is a "flat" SMP according to the most common
> >>terminology. Every fourth 512 Byte region has a memory and L3 cache
> >>latency that is a few cycles faster than the other three, but this is
> >>not detectable in practice.
> >
> >At the risk of being flamed, let me offer a quote from an IBM RedBook,
> >"Understanding IBM pSeries e-server Performance and Sizing":
> >

<snip>

>
>
> >I am assuming that the very broad 20-50 cycle delay on an L2 cache miss
> >depends on whether the data are to be found in L3 or not.
>
>
> It is probably not wise to try to make even broad generalizations
> from these numbers, since they correspond to older generations of
> IBM pSeries systems.
>
> It has been many years since IBM made a system with 20-50 cycle memory
> latency....

I was always impressed by the volume of literature that IBM made
available to mere mortals like me, and now you tell me it is not to be
trusted? 8^}. I will be more careful in the future.

>
> >None of this, I suspect, is relevant to the STREAM benchmark you quoted
> >because I suspect that you turned off the L2 and L3 caches for purposes
> >of the benchmark. Or not?
>
>
> Absolutely not. The published STREAM benchmark numbers are for
> ordinary, supported configurations. The "tuned" numbers use a
> PowerPC cache op to avoid reading the store target into the L3
> or L2 caches, but the system is run in a normal, supported state.
>

If you read any kind of accusation into my question, there was none
intended. Your very precise answer provided me with the exact
information I was looking for, though: You can bypass the cache
hierarchy if you are willing to code in assembly language, while still
having the L2 and L3 caches available where they would be useful.

> >The same document suggests that it is possible for a processor to bypass
> >the cache hierarchy altogether:
> >
> >"Large IBM SMP systems are therefore designed in a different way. There
> >is still a mechanism for the snooping activity and the addressing, but
> >another component has been added for data transfers. That component is a
> >switch. The switch allows point-to-point connections between a
> >processor and another processor or between a processor and the memory.
> >It also allows several simultaneous transfers."
>
>
> That is a very odd paragraph.
>
> IBM's SP switch is I/O-based. It does not exactly bypass the cache
> structure, though I/O does not interact with the caches in exactly the
> same way as the cpus do.
>
> Using the switch is more like accessing a disk than it is like accessing
> memory -- it requires explicit communications calls (usually via the
> MPI or LAPI libraries).
>
>
> >1. Is the switch/cache-hierarchy mecanism transparent to the user? If
> >not, how is it controlled?
>
>
> In general, using the switch is visible to the user, and usually
> requires explicit coding.
>

Erf. Just as you feared when you pondered whether you should answer my
post at all, your answer only raises more questions, this time referring
to an IBM Technical White Paper: "IBM e-Server POWER4 System
Microarchitecture" by Joel M. Tendler, Steve Dodson, Steve Fields, Hung
Le, and Balaram Sinharoy from the IBM Server Group and dated October
2001, so it's at most one year out of date. Figure 1 of that document
shows a dual-core unit with a "CIU" switch between the two processor
cores and three L2 caches. The "Core Interface Unit" shown in that
diagram, is pretty obviously not the SP switch you refer to. I am
guessing that "SP" stands for "Service Processor", and that it it's
functionality is accessed by the "SP Controller" that just floats on the
logical P4 die in the diagram, not really connected to anything else on
the figurative die, but that data accessed that way flow through memory
in the same way that normal I/O would. It should be fairly easy to
straighten this confusion out, if there is any.

>
>
> >2. If IBM can design a switch to so transparently turn what would
> >otherwise be a NUMA machine into a UMA machine for four processors or
> >eight, why can't it do so for 16, 32, or whatever?
>
>
> You are welcome to port a software Distributed Shared Memory system
> (e.g. TreadMarks) to the POWER4 platform (maybe even under Linux).
> This would allow construction of a NUMA system with reasonable
> transparency for user-mode applications -- as long as you did not
> time them.....
>
> Big IBM clusters are reasonably UMA between nodes for message-
> passing, but this should not be confused with shared memory.

The last confusion is one I was never in danger of. The confusion of my
post, which I thank you for straightening out, was the result of roughly
the equivalent of consulting a document on a P3-based Celeron to obtain
information on a P4-based Celeron, and I thank you for being gentle
about the confusion.

Alexis Cousein

unread,

Nov 18, 2002, 4:38:19 AM11/18/02

to McCalpin

McCalpin wrote:

> In article <3DD56C37...@brussels.sgi.com>,
> Alexis Cousein wrote:
>
> >No. Note John made sure he quoted the 4 cpu numbers. Oh, and the 128 cpu
> >per rack model actually doesn't have the same STREAM numbers per CPU, but
> >that's just a slight omission.
>
>
> The 4 cpu node gives 64p per rack and the 8p node gives 128p per
> rack. Either way you get about 250 GB/s in a fully configured rack.

Precisely. But half the STREAM/CPU.

Sander Vesik

unread,

Nov 18, 2002, 6:41:46 AM11/18/02

to

Yes, but the Linpack benchmark (which is used for the top 500 list"
doesn't actually allow such code modifications.

--
Sander

+++ Out of cheese error +++

Nick Maclaren

unread,

Nov 18, 2002, 6:48:05 AM11/18/02

to

In article <10376197...@haldjas.folklore.ee>,

Sander Vesik <san...@haldjas.folklore.ee> writes:
|>
|> Yes, but the Linpack benchmark (which is used for the top 500 list"
|> doesn't actually allow such code modifications.

Actually, it does. The 100x100 doesn't, but that is a joke for
such systems. The NxN merely requires you to solve the same
problem. Also, a lot of the figures used in the Top 500 list
are estimated and not calculated.

McCalpin

unread,

Nov 18, 2002, 8:36:41 AM11/18/02

to

In article <1d_B9.71723$nB.5200@sccrnsc03>,

Robert Myers <rmyer...@attbi.com> wrote:
>McCalpin wrote:
>
>> In article ,
>> Robert Myers wrote:
>>
>> >I am assuming that the very broad 20-50 cycle delay on an L2 cache miss
>> >depends on whether the data are to be found in L3 or not.
>>
>> It is probably not wise to try to make even broad generalizations
>> from these numbers, since they correspond to older generations of
>> IBM pSeries systems.
>>
>> It has been many years since IBM made a system with 20-50 cycle memory
>> latency....
>
>I was always impressed by the volume of literature that IBM made
>available to mere mortals like me, and now you tell me it is not to be
>trusted? 8^}. I will be more careful in the future.

There is a strong trend in the industry for CPU frequency to increase
at a much faster rate than memory latency. Any point measurement will
become obsolete relatively quickly, so it is important to explain the
trends.

>> >None of this, I suspect, is relevant to the STREAM benchmark you quoted
>> >because I suspect that you turned off the L2 and L3 caches for purposes
>> >of the benchmark. Or not?
>>
>> Absolutely not. The published STREAM benchmark numbers are for
>> ordinary, supported configurations. The "tuned" numbers use a
>> PowerPC cache op to avoid reading the store target into the L3
>> or L2 caches, but the system is run in a normal, supported state.
>>
>If you read any kind of accusation into my question, there was none
>intended. Your very precise answer provided me with the exact
>information I was looking for, though: You can bypass the cache
>hierarchy if you are willing to code in assembly language, while still
>having the L2 and L3 caches available where they would be useful.

This is mostly incorrect.

There are really only a few cases where data bypasses a cache in
POWER4 systems. The first is a store that misses the L1 --- it
gets gathered by the store buffers and dumped directly into the L2,
and is not allocated into the L1 cache. The second is a special
case that allows L2 castouts to bypass the L3 cache and go directly
to memory. This is not under direct user control.

For inbound data, cacheing depends on the type of access. Loads
come through the L3, into the L2, then into the L1, then into the
registers. Stores that miss in the caches bring data through the
L3 (but don't leave a copy there) and into the L2, bypass the L1,
and put the cache line into the store buffers so that the store
operation can update a coherent copy of the data. My "tuned"
STREAM benchmark uses the PowerPC DCBZ instruction (accessed by the
*!IBM CACHE_ZERO Fortran compiler directive) to allocate a line in
the L2 (and store buffers), filling it with zeroes and not reading
the original contents from memory.

>> >The same document suggests that it is possible for a processor to bypass
>> >the cache hierarchy altogether:
>> >
>> >"Large IBM SMP systems are therefore designed in a different way. There
>> >is still a mechanism for the snooping activity and the addressing, but
>> >another component has been added for data transfers. That component is a
>> >switch. The switch allows point-to-point connections between a
>> >processor and another processor or between a processor and the memory.
>> >It also allows several simultaneous transfers."
>>
>> That is a very odd paragraph.

Ooops -- I finally understand what that paragraph is talking about!

The easiest way to describe it is to say that there is a global
broadcast (bus) mechanism for addresses and cache coherency operations,
but a separate, switch-based interconnect for the actual data transfers.
So, for example, on a p690, each of the four POWER4 chips in an MCM
can be simultaneously be writing 16 Bytes every other cycle to their
"fabric busses" (with the four chips reading that written data at
the same time).

>> >1. Is the switch/cache-hierarchy mecanism transparent to the user? If
>> >not, how is it controlled?

This "fabric switch" is absolutely transparent to the user.
It is not a way to bypass the caches -- it is a way to handle
data transfers on (electrically) point-to-point links.

Each chip has two sets of ports coming out of its fabric switch:
one set for intra-MCM communication and one for inter-module
communication.

A single-MCM machine like the p655 uses only the intra-MCM fabric.
A machine made of single-chip modules like the p650 uses only the
inter-module fabric.
A multi-MCM machine like the p670/p690 uses both.

The inter-MCM ring used in the p690 could be extended to more
MCMs, providing a larger SMP, but the current product is limited
to four MCMs. (Rings are not very efficient when they get big.)

All of this is transparent to the user, and the existence of these
multiple fabrics can only be deduced from careful low-level performance
measurements.

From the user point of view, these are simply cache-coherent SMPs.

Robert Myers

unread,

Nov 18, 2002, 9:14:30 AM11/18/02

to

Sander Vesik wrote:

Very odd thing for comp.arch: I was thinking about solving actual
problems rather than about showing off [ as a a corporate entity, top
500 and all that, of course 8^} ].

Colin Andrew Percival

unread,

Nov 18, 2002, 10:44:12 AM11/18/02

to

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
> Sander Vesik <san...@haldjas.folklore.ee> writes:
> |>
> |> Yes, but the Linpack benchmark (which is used for the top 500 list"
> |> doesn't actually allow such code modifications.

> Actually, it does. The 100x100 doesn't, but that is a joke for
> such systems. The NxN merely requires you to solve the same
> problem. Also, a lot of the figures used in the Top 500 list
> are estimated and not calculated.

I believe the rules are
1. You can't use an algorithm wirh running time less than 2/3 n^3 + O(n^2).
2. You must end up with a residue not much larger than partial pivoting
would produce, and
3. You must count the FLOPS as 2/3 n^3, regardless of the number actually
performed.

My numerical analysis background is rather limited, but I think iterative
methods fail under (2); however, something like QR should be accepted
(albeit at a 50% penalty due to the increased operation count).

Colin Percival

Sander Vesik

unread,

Nov 18, 2002, 11:06:13 AM11/18/02

to

Robert Myers <rmyer...@attbi.com> wrote:

I wrote:
>> Yes, but the Linpack benchmark (which is used for the top 500 list"
>> doesn't actually allow such code modifications.
>>
> Very odd thing for comp.arch: I was thinking about solving actual
> problems rather than about showing off [ as a a corporate entity, top
> 500 and all that, of course 8^} ].
>

And as Nick pointed out, I was wrong about NxN linpack anyways.

Nick Maclaren

unread,

Nov 18, 2002, 11:16:45 AM11/18/02

to

In article <arb1sc$b90$1...@morgoth.sfu.ca>,

Colin Andrew Percival <cper...@sfu.ca> writes:
|>
|> I believe the rules are
|> 1. You can't use an algorithm wirh running time less than 2/3 n^3 + O(n^2).
|> 2. You must end up with a residue not much larger than partial pivoting
|> would produce, and
|> 3. You must count the FLOPS as 2/3 n^3, regardless of the number actually
|> performed.

That sounds about right.

|> My numerical analysis background is rather limited, but I think iterative
|> methods fail under (2); however, something like QR should be accepted
|> (albeit at a 50% penalty due to the increased operation count).

There are iterative methods that will deliver that accuracy, but
the more correction you do, the longer it takes. It is a LONG
time since I had anything to do with that area, so my memory is
rusty.

However, provided that your machines have enough memory, the
standard parallel forms of solution can be converted to be
almost arbitrarily CPU-bound by winding up N hard enough and
using very careful blocking to maximise cache use.

Iain Bason - Forte Tools

unread,

Nov 18, 2002, 12:28:44 PM11/18/02

to

In article <BYRB9.68950$1O2.4084@sccrnsc04>,
Robert Myers <rmyer...@attbi.com> wrote:

>Peter Boyle wrote:
>
>That's part of why I think Intel's hypethreading is a real contribution
>to computing even if it turns out to be mostly just hyperhype. AFAICT,
>there is no production compiler that would even *try* to multi-thread an
>application, and I expect the appearance of hyperthreading to change that.

I'm not really familiar with the Intel world, but in general automatically
parallelizing compilers have been around for a long time, as have compilers
that parallelize based on pragmas inserted into the code. Neither of
these would be of much use in a GUI application, which really ought to
be multi-threaded anyway. Parallelizing compilers are pretty much
designed to improve the performance on number-crunching code.

>While such a compiler would be of no use for the low-cost interconnect
>situation, it is but a short step from such a compiler to tools that would.

On the contrary, parallelizing compilers generate code that tends not to
scale to a large number of processors, and that requires high bandwidth
and low latency communications. I'm not aware of even research compilers
that would do a good job of automatically parallelizing an application
for a distributed memory system with low bandwidth high latency communications.
That is a much harder problem to solve.

Iain

Robert Myers

unread,

Nov 18, 2002, 1:16:56 PM11/18/02

to

Iain Bason - Forte Tools wrote:

> In article ,

> Robert Myers wrote:
>
> >Peter Boyle wrote:
> >
> >That's part of why I think Intel's hypethreading is a real contribution
> >to computing even if it turns out to be mostly just hyperhype. AFAICT,
> >there is no production compiler that would even *try* to multi-thread an
> >application, and I expect the appearance of hyperthreading to change
> that.
>
>
> I'm not really familiar with the Intel world, but in general automatically
> parallelizing compilers have been around for a long time, as have
> compilers
> that parallelize based on pragmas inserted into the code. Neither of
> these would be of much use in a GUI application, which really ought to
> be multi-threaded anyway. Parallelizing compilers are pretty much
> designed to improve the performance on number-crunching code.

You use the word parallelizing interchangeably with multi-threading, and
if that's the accepted usege, I'll accept it, too, but it runs the risk
of confusing a vectorizing compiler with a multi-threading compiler.

I am aware that some compilers allow the use of pragmas to achieve
multi-threading, but what I was imagining was a compiler that would
attempt to multi-thread code (as opposed to vectorizing it) without
intervention from the programmer. Such compilers may well exist, but I
am not aware of them.

>
> >While such a compiler would be of no use for the low-cost interconnect
> >situation, it is but a short step from such a compiler to tools that
> would.
>
>
> On the contrary, parallelizing compilers generate code that tends not to
> scale to a large number of processors, and that requires high bandwidth
> and low latency communications. I'm not aware of even research compilers
> that would do a good job of automatically parallelizing an application
> for a distributed memory system with low bandwidth high latency
> communications.
> That is a much harder problem to solve.

Way in over my head here, but analyzing code and automatically
identifying what could be usefully forked as a separate thread would
seem to be a task of fundamental importance for developing a compiler
that could even begin to utilize the capabilities of a distributed
memory system without a lot of hand coding.

I suspect that dynamic optimization could go a long way toward solving
the more general problem: Compile and simulate on a shared-memory
system, watch the system at work (something like Dyanamorio), then
attempt to reorganize the calculation to be suitable for distributed
memory. There must be a few dozen Ph.D. theses there.

Nick Maclaren

unread,

Nov 18, 2002, 1:29:27 PM11/18/02

to

In article <s8aC9.43969$WL3.17058@rwcrnsc54>,
Robert Myers <rmyer...@attbi.com> writes:

|> I am aware that some compilers allow the use of pragmas to achieve
|> multi-threading, but what I was imagining was a compiler that would
|> attempt to multi-thread code (as opposed to vectorizing it) without
|> intervention from the programmer. Such compilers may well exist, but I
|> am not aware of them.

They exist, and have existed for over a decade. Iain Bason's
reservations are fully justified, and related to why Intel's
Hypethreading will NOT help much.

It isn't uncommon for such compilers to introduce a significant
or even large overhead - say 40% for two threads. That is still
a gain with two, genuinely parallel processors, but turns Intel's
much-flaunted Hypethreading into a loss.

David Gay

unread,

Nov 18, 2002, 2:07:14 PM11/18/02

to

Robert Myers <rmyer...@attbi.com> writes:
> Iain Bason - Forte Tools wrote:
> > On the contrary, parallelizing compilers generate code that tends not to
> > scale to a large number of processors, and that requires high bandwidth
> > and low latency communications. I'm not aware of even research compilers
> > that would do a good job of automatically parallelizing an application
> > for a distributed memory system with low bandwidth high latency
> > communications.
> > That is a much harder problem to solve.
>
> Way in over my head here, but analyzing code and automatically identifying
> what could be usefully forked as a separate thread would seem to be a task
> of fundamental importance for developing a compiler that could even begin
> to utilize the capabilities of a distributed memory system without a lot of
> hand coding.

Well yes. But it's *hard*. It's a bit like saying that a compiler that
automatically analysed your code and identified bugs is of fundamental
importance to producing correct software (i.e., true but that doesn't
really help - admittedly the "finding bugs" goal is harder than the
"automatically parallelise code" goal).

The problems start with alias analysis, i.e., getting a good understanding
of what bits of data are accessed by what bits of code (in fact problem #1
is coming up with a useful way of distinguishing the various bits of data:
what do you use? variable names? malloc-and-friends call sites? access
paths from variables? None of these is particularly guaranteed to be
useful...) If you can't get a good answer here, then it's going to be
hard to split the computation in a way that keeps communication bandwidth
low.

People have also tried changing the language, e.g., to a purely functional
language (i.e., functions are actually mathematical functions with no side
effects). The lack of side effects makes parallelisation easy, but my
understanding is that the speedups were not spectacular. Lack of side
effects makes efficient compilation of array-update-based compuations
hard... This kind of approach (switch the language to make the
analysis/compilation problem tractable) also has the problem that you have
to get people to switch to a new language (never easy).

> I suspect that dynamic optimization could go a long way toward solving the
> more general problem: Compile and simulate on a shared-memory system, watch
> the system at work (something like Dyanamorio), then attempt to reorganize
> the calculation to be suitable for distributed memory. There must be a few
> dozen Ph.D. theses there.

There's been a lot of (good) research, and theses, on parallel/distributed/etc
computing. It's not as simple as you seem to think (e.g., what you just said
has the naming problem I mentioned above...)

--
David Gay
dg...@acm.org

Greg Lindahl

unread,

Nov 18, 2002, 2:35:04 PM11/18/02

to

In article <BYRB9.68950$1O2.4084@sccrnsc04>,
Robert Myers <rmyer...@attbi.com> wrote:

> AFAICT,
> there is no production compiler that would even *try* to multi-thread an
> application, and I expect the appearance of hyperthreading to change that.

Tera's architecture depended quite heavily on an automatic
multi-threading compiler. It was very similar to the analysis required
to do a vectorizing compiler.

greg

Toon Moene

unread,

Nov 18, 2002, 4:26:08 PM11/18/02

to

Greg Lindahl wrote:

And it helps to do it for a language that has no aliasing problems
beyond the ones the programmer tells the compiler about directly (i.e.,
EQUIVALENCE).

Sorry, couldn't resist, because - as usual - people start to rummage
about how it is so hard to do alias analysis on <substitute your
favourite C dialect here>.

--
Toon Moene - mailto:to...@moene.indiv.nluug.nl - phoneto: +31 346 214290
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
Maintainer, GNU Fortran 77: http://gcc.gnu.org/onlinedocs/g77_news.html
Join GNU Fortran 95: http://g95.sourceforge.net/ (under construction)

Robert Myers

unread,

Nov 18, 2002, 4:41:30 PM11/18/02

to

David Gay wrote:

> Robert Myers writes:
>
>
> >I suspect that dynamic optimization could go a long way toward
> solving the
> >more general problem: Compile and simulate on a shared-memory system,
> watch
> >the system at work (something like Dyanamorio), then attempt to
> reorganize
> >the calculation to be suitable for distributed memory. There must be
> a few
> >dozen Ph.D. theses there.
>
>
> There's been a lot of (good) research, and theses, on
> parallel/distributed/etc computing. It's not as simple as you seem to
> think (e.g., what you just said has the naming problem I mentioned
> above...)
>

This is a case of the wail tagging the dog. Despite a career-long and
poorly-concealed skepticism of well-funded hotbeds of research where
research is largely directed by fashionability, I don't have the
hubris/narcissism to imagine that I could so easily offer areas of
research that haven't been thought of before and maybe already beaten to
death.

What I am doing in thinking about how hyperthreading might transform the
world is revealing another career-long bias of mine: give somebody a
chance to make some real money from what had hitherto been a labor of
love/desire-for-fame/desparation-to-get-tenure, and miracles will happen.

Greg Lindahl

unread,

Nov 18, 2002, 4:47:29 PM11/18/02

to

In article <3DD95AF0...@moene.indiv.nluug.nl>,
Toon Moene <to...@moene.indiv.nluug.nl> wrote:

>Sorry, couldn't resist, because - as usual - people start to rummage
>about how it is so hard to do alias analysis on <substitute your
>favourite C dialect here>.

Hey, if they want to hurt themselves, should we really try to stop them?

-- greg

Alex Colvin

unread,

Nov 18, 2002, 5:29:22 PM11/18/02

to

>> Tera's architecture depended quite heavily on an automatic
>> multi-threading compiler. It was very similar to the analysis required
>> to do a vectorizing compiler.

>And it helps to do it for a language that has no aliasing problems
>beyond the ones the programmer tells the compiler about directly (i.e.,
>EQUIVALENCE).

It also helps a lot to have a full/empty bit so you can synchronize on any
word even in the presence of incomplete dataflow and aliasing information.
And what's the cost of an extra bit of memory these days?
--
mac the naïf

David Gay

unread,

Nov 18, 2002, 6:07:59 PM11/18/02

to

Toon Moene <to...@moene.indiv.nluug.nl> writes:
> Greg Lindahl wrote:
>
> > In article <BYRB9.68950$1O2.4084@sccrnsc04>,
> > Robert Myers <rmyer...@attbi.com> wrote:
>
> >> AFAICT, there is no production compiler that would even *try* to
> >> multi-thread an application, and I expect the appearance of
> >> hyperthreading to change that.
>
> > Tera's architecture depended quite heavily on an automatic
> > multi-threading compiler. It was very similar to the analysis required
> > to do a vectorizing compiler.
>
> And it helps to do it for a language that has no aliasing problems beyond
> the ones the programmer tells the compiler about directly (i.e.,
> EQUIVALENCE).
>
> Sorry, couldn't resist, because - as usual - people start to rummage about
> how it is so hard to do alias analysis on <substitute your favourite C
> dialect here>.

Though I at least wouldn't want to write a compiler in Fortran 77 ;-) Do you?
I read the subject as being "can we auto-parallelise everyday apps", not
just array-based scientifc code.

I also believe that alias analysis remains hard in pretty much all languages
where you can build arbitrary pointer-based data structures (i.e., pretty
much all object-oriented languages, lisp, C, Ada, etc, etc) C just has a few
features which make it extra painful (can access anything with char *,
can cast pointers to integers and back).

--
David Gay
dg...@acm.org

Robert A Duff

unread,

Nov 18, 2002, 7:40:57 PM11/18/02

to

David Gay <dg...@lagaffe.CS.Berkeley.EDU> writes:

> Though I at least wouldn't want to write a compiler in Fortran 77 ;-) Do you?

Nor C. ;-)

> I read the subject as being "can we auto-parallelise everyday apps", not
> just array-based scientifc code.
>
> I also believe that alias analysis remains hard in pretty much all languages
> where you can build arbitrary pointer-based data structures (i.e., pretty
> much all object-oriented languages, lisp, C, Ada, etc, etc) C just has a few
> features which make it extra painful (can access anything with char *,
> can cast pointers to integers and back).

The main problem with C in this regard is not that it allows pointers,
nor that it allows the char* and casting nonsense you mentioned, but
that you *have* to use pointers all over the place. For example, in Ada
you would use an 'in out' parameter, where in C you would use a
"something*". For another example, most languages distinguish arrays
from pointers-to-arrays; C confuses them, so when you have a "thing*"
parameter, the compiler doesn't know if it's a pointer to a thing or a
pointer to an array of things. I suppose the problem is similar in the
"everything's a reference" languages (Java, Lisp, Eiffel, ...), but not
in Fortran, Ada, Pascal, Modula-2, ....

By the way, Ada has similar rules to Fortran regarding aliasing of
(non-pointer) formal parameters (dummy arguments, in Fortran).
This makes life easier for an optimizer (compared to C) when doing
alias analysis.

- Bob

Greg Lindahl

unread,

Nov 18, 2002, 8:38:37 PM11/18/02

to

In article <s71y97q...@lagaffe.CS.Berkeley.EDU>,
David Gay <dg...@lagaffe.CS.Berkeley.EDU> wrote:

>Though I at least wouldn't want to write a compiler in Fortran 77 ;-) Do you?

You'd be advised to at least use Fortran 2000 ;-)

>I read the subject as being "can we auto-parallelise everyday apps", not
>just array-based scientifc code.

As a by the way, vectorized loop analysis is very useful in some
non-scientific integer applications: unrolling, jamming, splitting,
generation of SSE2 instructions, etc. It doesn't do much in, say, gcc,
but bzip2 seems to have some nice loops with unrolling...

greg

ma...@sandbridgetech.com

unread,

Nov 18, 2002, 9:59:13 PM11/18/02

to

Iain Bason - Forte Tools wrote:
>

Two problems:
- automatically distributing loops across multiple processors/multiple
threads; fairly well solved in at least the FORTRAN world for a large
class of loops (see concepts like SPMD, owner-computes etc.)
- automatically extracting threads from an imperative language for NON
loopy code. Almost impossible from a procedural language. A fair amount
of work done on data-flow languages (Id for one). Goes quite the other
direction - there is an embarassment of threads. Unfortunately, its
pretty hard to get any good performance out of them. (See monsoon).

Actually, the work I have looked at suggests that if you try to
"locally" thread a program in a procedural language (i.e. the compiler
tries to analyze a block/SESE/function, and extract multiple threads out
of it), it may be able to keep two threads busy. However, the IPC per
thread will go down by a factor of 2 as well. The amount of parallelism
available locally is pretty constant (2) and that parallelism can either
be exploited through threading or through OOO, but not through both :)

Another point: a large subset of the applications that benefit from loop
multithreading fall into the "embarassingly parallel" class - i.e. you
can use just about any technique, including vectorization, to exploit
that parallelism. In the context of an Intel Pentium IV, it should be
completely possible to saturate the resources of the chip with *one*
thread.

Mayan

Ketil Malde

unread,

Nov 19, 2002, 2:20:41 AM11/19/02

to

David Gay <dg...@lagaffe.CS.Berkeley.EDU> writes:

> The lack of side effects makes parallelisation easy, but my
> understanding is that the speedups were not spectacular. Lack of side
> effects makes efficient compilation of array-update-based compuations
> hard...

Well, of course, you can't have it both ways. Updateable arrays are
inherently serial. This is, I think, also illustrated by
e.g. aliasing problems in C.

(BTW, I think most functional languages provide updateable arrays, and
some means to ensure a deterministic sequence of evaluation (monads in
Haskell, uniqueness types in Clean, for instance))

-kzm
--
If I haven't seen further, it is by standing in the footprints of giants

Aaron Spink

unread,

Nov 19, 2002, 2:57:14 AM11/19/02

to

"Robert Myers" <rmyer...@attbi.com> wrote in message
news:ALBB9.60043$QZ.9877@sccrnsc02...

> "On a typical implementation, it will take one cycle to access data from
> L1 if there is a cache hit in L1. It will take between seven to 10
> cycles to access data from L2 in case of a cache miss in L1 and a cache
> hit in L2. It will take between 20 to 50 cycles to get data from memory
> in case of a cache miss in L2."
>

Christ on toast. 50 cycles at 1.3 GHz... 38.4 nS including logic overheads
in the memory controller and physical transport to the memory and back.
Where O Where can I get that ~10nS main memory from and how much does it
cost. :)

I'm assuming that either the document is wrong or cycles at IBM mean
something other than cycles in the rest of the world.

Aaron Spink
speaking for myself inc

Nick Maclaren

unread,

Nov 19, 2002, 4:10:32 AM11/19/02

to

In article <wcc1y5i...@shell01.TheWorld.com>,

Robert A Duff <bob...@shell01.TheWorld.com> writes:
|>
|> The main problem with C in this regard is not that it allows pointers,
|> nor that it allows the char* and casting nonsense you mentioned, but

|> that you *have* to use pointers all over the place. ...

A related one is that you HAVE to use unchecked casts, in order to
call any generic function (including much of the standard library).
C++ is a bit better, but not much.

I have never liked casts, and much prefer conversion operators,
but the latter lost out to the former in Algol 68. And this is
a place where C has inherited from Algol 68.

Christoph Breitkopf

unread,

Nov 19, 2002, 5:36:47 AM11/19/02

to

lin...@pbm.com (Greg Lindahl) writes:

> As a by the way, vectorized loop analysis is very useful in some
> non-scientific integer applications: unrolling, jamming, splitting,
> generation of SSE2 instructions, etc. It doesn't do much in, say, gcc,
> but bzip2 seems to have some nice loops with unrolling...

bzip2 happens to be one of my in-house mini-benchmarks, and
the Intel C compiler does not perform better than gcc on it.
I have never looked at the source - are those 'nice loops'
candidates for SSE2 only? 'cause I have only a Pentium III
available for testing.

gcc's -funroll-loops did not make any difference, and
neither did profiling information (icc's -prof_gen/prof_use,
gcc's -fprofile-arcs/-fbranch-probabilities).

Regards,
Chris

Bernd Paysan

unread,

Nov 19, 2002, 6:33:14 AM11/19/02

to

Christoph Breitkopf wrote:
> bzip2 happens to be one of my in-house mini-benchmarks, and
> the Intel C compiler does not perform better than gcc on it.

My impression is that a lot of the compiler art involved in icc (and other
SPEC monster compilers) is targetted at "bad programs". On the other side,
programs in wide use typically are improved at the source level, and tuned
to run well on GCC.

GCC recently aquired features to transform code. For my programs (especially
Gforth), the effect is desastrous. I even had to add a flag to disable one
of the code transformations (cross jump) to get it compile to something
reasonable. Since they do run SPEC on GCC, I suppose this sort of code
transformation does improve SPEC. Please not. Bad programs should run slow,
to force programmers to learn and understand why they run slow, and what to
do to make them run fast. Stunts like what Sun did on 179.art are really
dangerous. People then think that memory layout doesn't matter, and that
bunches of malloc()s are as fast as statically allocated stuff.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Robert Myers

unread,

Nov 19, 2002, 11:02:29 AM11/19/02

to

Bernd Paysan wrote:

> Christoph Breitkopf wrote:
>
> >bzip2 happens to be one of my in-house mini-benchmarks, and
> >the Intel C compiler does not perform better than gcc on it.
>
>
> My impression is that a lot of the compiler art involved in icc (and
> other
> SPEC monster compilers) is targetted at "bad programs". On the other
> side,
> programs in wide use typically are improved at the source level, and
> tuned
> to run well on GCC.

On what platform, though? Even restricting the universe of target
platforms to x86, the set of possible optimizations as well as what
constitutes a "good" program is platform specific.

>
>
> GCC recently aquired features to transform code. For my programs
> (especially
> Gforth), the effect is desastrous. I even had to add a flag to disable
> one
> of the code transformations (cross jump) to get it compile to something
> reasonable. Since they do run SPEC on GCC, I suppose this sort of code
> transformation does improve SPEC. Please not. Bad programs should run
> slow,
> to force programmers to learn and understand why they run slow, and
> what to
> do to make them run fast.

With all due respect to your very sensible view of the world, writing
programs that are "good" programs independent of the target platform is
a needle that is well-nigh unthreadable, especially if you include the
anomalous but very important case of the P4 with its unusually deep
pipeline and SSE2 instructions.

Andy Isaacson

unread,

Nov 19, 2002, 1:00:08 PM11/19/02

to

In article <u9mC9.5702$fY3.6...@newsread2.prod.itd.earthlink.net>,

As John M pointed out in another post, that document probably dates from
the 150 MHz days. I think it's
<URL:http://publib-b.boulder.ibm.com/Redbooks.nsf/9445fa5b416f6e32852569ae006bb65f/203cc8a81cbc374d8525688e00707ac3?OpenDocument>
(wheee, mile-long links) which claims to be an update of a document
originally published in 1997.

-andy

Niels Jørgen Kruse

unread,

Nov 19, 2002, 1:23:34 PM11/19/02

to

I artiklen <araqd9$7lk$1...@ausnews.austin.ibm.com> ,
mcca...@gmp246.austin.ibm.com (McCalpin) skrev:

> There are really only a few cases where data bypasses a cache in
> POWER4 systems. The first is a store that misses the L1 --- it
> gets gathered by the store buffers and dumped directly into the L2,
> and is not allocated into the L1 cache. The second is a special
> case that allows L2 castouts to bypass the L3 cache and go directly
> to memory. This is not under direct user control.

Does the second case apply to the STREAM benchmark (tuned or not)?

I wonder how blocked access would be separated from streamed.

> For inbound data, cacheing depends on the type of access. Loads
> come through the L3, into the L2, then into the L1, then into the
> registers. Stores that miss in the caches bring data through the
> L3 (but don't leave a copy there) and into the L2, bypass the L1,
> and put the cache line into the store buffers so that the store
> operation can update a coherent copy of the data. My "tuned"
> STREAM benchmark uses the PowerPC DCBZ instruction (accessed by the
> *!IBM CACHE_ZERO Fortran compiler directive) to allocate a line in
> the L2 (and store buffers), filling it with zeroes and not reading
> the original contents from memory.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Del Cecchi

unread,

Nov 19, 2002, 3:21:26 PM11/19/02

to

In article <u9mC9.5702$fY3.6...@newsread2.prod.itd.earthlink.net>,

I believe that may have been from a document referring to some earlier processor
implementation.

--

Del Cecchi
cec...@us.ibm.com
Personal Opinions Only

Robert Myers

unread,

Nov 19, 2002, 4:50:23 PM11/19/02

to

Del Cecchi wrote:

> In article ,
> "Aaron Spink" writes:
> |>
> |> "Robert Myers" wrote in message

> |> news:ALBB9.60043$QZ.9877@sccrnsc02...
> |> > "On a typical implementation, it will take one cycle to access
> data from
> |> > L1 if there is a cache hit in L1. It will take between seven to 10
> |> > cycles to access data from L2 in case of a cache miss in L1 and a
> cache
> |> > hit in L2. It will take between 20 to 50 cycles to get data from
> memory
> |> > in case of a cache miss in L2."
> |> >
> |> Christ on toast. 50 cycles at 1.3 GHz... 38.4 nS including logic
> overheads
> |> in the memory controller and physical transport to the memory and back.
> |> Where O Where can I get that ~10nS main memory from and how much
> does it
> |> cost. :)
> |>
> |> I'm assuming that either the document is wrong or cycles at IBM mean
> |> something other than cycles in the rest of the world.
> |>
> |> Aaron Spink
> |> speaking for myself inc
> |>
>
> I believe that may have been from a document referring to some earlier
> processor
> implementation.
>
>

It was, a fact I might have inferred from any number of clues in the
document, had I been concerned about its relevance. The IBM document
number was SG244810, an IBM RedBook I downloaded it from the IBM web
site the very day I made the original post, November 16, 2002.

I have finally gotten to the point where I can pick through Intel's
duplication of names (Celeron, Celeron, and Celeron--the only thing you
know for sure is that it's a crippled Pentium, but knowing which Pentium
and how requires almost daily attention), but IBM's marketing department
is a challenge I hadn't had to deal with in over a decade.

Greg Lindahl

unread,

Nov 19, 2002, 6:21:40 PM11/19/02

to

>I have never looked at the source - are those 'nice loops'
>candidates for SSE2 only? 'cause I have only a Pentium III
>available for testing.

I referred to loop unrolling, splitting, and jamming, as well as SSE2.
By looking at only two compilers not exactly known for optimizations
of these types, you probably won't learn much.

greg

Colin Andrew Percival

unread,

Nov 19, 2002, 8:04:29 PM11/19/02

to

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
> Colin Andrew Percival <cper...@sfu.ca> writes:
> |> My numerical analysis background is rather limited, but I think iterative
> |> methods fail under (2); however, something like QR should be accepted
> |> (albeit at a 50% penalty due to the increased operation count).

> provided that your machines have enough memory, the

> standard parallel forms of solution can be converted to be
> almost arbitrarily CPU-bound by winding up N hard enough and
> using very careful blocking to maximise cache use.

Only if your network latency is low enough -- partial pivoting requires at
least O(N) times the interconnect latency. Can you get a lab of
workstations with application-to-application interconnect latency of less
than a millisecond?
QR has the advantage of being wonderfully numerically stable; as a result,
it can be fully blocked, reducing the latency-critical path to O(sqrt(p)).
(There are other alternatives, but QR is the best known.)

Colin Percival

del cecchi

unread,

Nov 18, 2002, 9:19:28 PM11/18/02

to

"Robert Myers" <rmyer...@attbi.com> wrote in message

news:1d_B9.71723$nB.5200@sccrnsc03...

> McCalpin wrote:
>
> > In article ,
> > Robert Myers wrote:
> >

> > >McCalpin wrote:
> > >
> > >>Memory is interleaved at 512 Byte granularity around the four
memory
> > >>controllers, so this is a "flat" SMP according to the most common
> > >>terminology. Every fourth 512 Byte region has a memory and L3
cache
> > >>latency that is a few cycles faster than the other three, but this
is
> > >>not detectable in practice.
> > >
> > >At the risk of being flamed, let me offer a quote from an IBM
RedBook,
> > >"Understanding IBM pSeries e-server Performance and Sizing":
> > >
>
> <snip>
>
> >
> >
> > >I am assuming that the very broad 20-50 cycle delay on an L2 cache
miss
> > >depends on whether the data are to be found in L3 or not.
> >
> >
> > It is probably not wise to try to make even broad generalizations
> > from these numbers, since they correspond to older generations of
> > IBM pSeries systems.
> >
> > It has been many years since IBM made a system with 20-50 cycle
memory
> > latency....
>
> I was always impressed by the volume of literature that IBM made
> available to mere mortals like me, and now you tell me it is not to be
> trusted? 8^}. I will be more careful in the future.
>
Just check the date. Since literature going back to 360 is available on
line what you download might not be current. :-)

del cecchi

unread,

Nov 18, 2002, 9:22:07 PM11/18/02

to

"Greg Lindahl" <lin...@pbm.com> wrote in message
news:3dd94086$1...@news.meer.net...

Is there any significance to your use of the past tense when refering to
Tera MTA? Something I missed?

del cecchi

unread,

Nov 19, 2002, 9:24:36 PM11/19/02

to

"Robert Myers" <rmyer...@attbi.com> wrote in message

news:zmyC9.85731$1O2.6149@sccrnsc04...

IBM apparently can afford a bunch of disks and servers :-) because there
is really a lot of documentation on line for free. Some of it is for
stuff that isn't particularily new. The System/360 Principles of
Operation are on there. Good thing someone isn't wolfing about 1 cycle
access to memory. :-)

Redbooks sometimes get updated to add new models without the old stuff
being removed.

del cecchi.
I save everything too. I still have my SLT manual.

Nick Maclaren

unread,

Nov 20, 2002, 3:05:58 AM11/20/02

to

In article <aren2t$6q3$1...@morgoth.sfu.ca>,

Colin Andrew Percival <cper...@sfu.ca> writes:

|> Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
|>
|> > provided that your machines have enough memory, the
|> > standard parallel forms of solution can be converted to be
|> > almost arbitrarily CPU-bound by winding up N hard enough and
|> > using very careful blocking to maximise cache use.
|>
|> Only if your network latency is low enough -- partial pivoting requires at
|> least O(N) times the interconnect latency. Can you get a lab of
|> workstations with application-to-application interconnect latency of less
|> than a millisecond?

That was my point about enough memory. The number of operations
goes up as N^3 and the memory by N^2, so that even a large factor
of N can be reduced to insignificance by increasing N far enough.
That is what many of the vendors that quote high figures on machines
with poor interconnects do.

A reply to your question is "easily". Even TCP/IP on Ethernet is
now (just) below a millisecond on many systems, though often only
when they are being incestuous. There is no problem in doing much
better if you bypass TCP/IP.

If you had said "less than 10 microseconds", the answer would have
been different.

Colin Andrew Percival

unread,

Nov 20, 2002, 3:48:56 AM11/20/02

to

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
> Colin Andrew Percival <cper...@sfu.ca> writes:
> |> least O(N) times the interconnect latency.
> |> Can you get a lab of
> |> workstations with application-to-application interconnect latency of less
> |> than a millisecond?

> A reply to your question is "easily". Even TCP/IP on Ethernet is

> now (just) below a millisecond on many systems, though often only
> when they are being incestuous. There is no problem in doing much
> better if you bypass TCP/IP.

Sorry, I should have clarified my question somewhat further: Can you get a

lab of workstations with application-to-application interconnect latency of

less than a millisecond, while said workstations are being used for other
purposes, and users will get annoyed with you if you interfere with their
work?
In such a scenario, you can't even rely upon receiving a time slice
promptly, so you run into problems even before you get started.

Colin Percival

Nick Maclaren

unread,

Nov 20, 2002, 4:03:06 AM11/20/02

to

In article <arfi9o$pou$1...@morgoth.sfu.ca>,

Colin Andrew Percival <cper...@sfu.ca> writes:
|>

|> Sorry, I should have clarified my question somewhat further: Can you get a

|> lab of workstations with application-to-application interconnect latency of

|> less than a millisecond, while said workstations are being used for other
|> purposes, and users will get annoyed with you if you interfere with their
|> work?

You can by running at a higher priority and ignoring the screams of
complaint (do I hear cries of "BOFH"?) :-)

|> In such a scenario, you can't even rely upon receiving a time slice
|> promptly, so you run into problems even before you get started.

Yes, I agree. Such facilities are fairly usable for things like
a lot of Monte Carlo work and some search techniques, where you
have a near-infinite pool of tasks, virtually no synchronisation
requirements and can simply repeat a delayed task on some other
agent. But Linpack is not one such.

Colin Andrew Percival

unread,

Nov 20, 2002, 4:36:15 AM11/20/02

to

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
> You can by running at a higher priority and ignoring the screams of
> complaint (do I hear cries of "BOFH"?) :-)

Unfortunately, that leads to a rapid fall in MTBF, as users reboot or turn
off their machines. :-)

> |> In such a scenario, you can't even rely upon receiving a time slice
> |> promptly, so you run into problems even before you get started.

> Yes, I agree. Such facilities are fairly usable for things like
> a lot of Monte Carlo work and some search techniques, where you
> have a near-infinite pool of tasks, virtually no synchronisation
> requirements and can simply repeat a delayed task on some other
> agent. But Linpack is not one such.

I think that, given an algorithm other than partial pivoting, and massive
redundancy, the Linpack Benchmark could be run on such systems; but there
are certainly major issues involved.

Colin Percival

Christoph Breitkopf

unread,

Nov 20, 2002, 6:19:54 AM11/20/02

to

lin...@pbm.com (Greg Lindahl) writes:

If the Intel compiler is 'not exactly know for optimizations
of these types', which compiler do you recommend? And why does
everyone use the Intel compiler for SPEC CPU? (Which includes
an older version of bzip2, after all).

Regards,
Chris

Robert Myers

unread,

Nov 20, 2002, 9:21:17 AM11/20/02

to

Colin Andrew Percival wrote:

Nobody ever said it would be easy; just that there's alot of cycles
there begging to be used. The challenge, and it is not necessarily a
trivial one, is finding a way to put them to work.

Colin Andrew Percival

unread,

Nov 20, 2002, 12:28:12 PM11/20/02

to

Robert Myers <rmyer...@attbi.com> wrote:
> Nobody ever said it would be easy; just that there's alot of cycles
> there begging to be used. The challenge, and it is not necessarily a
> trivial one, is finding a way to put them to work.

s/The challenge/My DPhil research project at present/ :-)

Colin Percival

Andy Isaacson

unread,

Nov 20, 2002, 5:01:04 PM11/20/02

to

In article <m3k7j8i...@eddie.mignet.mragrathea.de>,

Greg's presumably referring to compilers for other platforms like Alpha,
Cray, NEC SX, SGI. None of them generate code for x86. The Intel
compiler is probably the best commercially-available code generator for
the x86 platform, but that's like saying "I'm the strongest patient in
the cancer ward."

-andy

del cecchi

unread,

Nov 20, 2002, 10:39:38 PM11/20/02

to

"Colin Andrew Percival" <cper...@sfu.ca> wrote in message
news:arfl2f$roq$1...@morgoth.sfu.ca...

The load leveler jobs are niced and you hardly notice them, unless you
ignore the machine for half an hour and all your processes get paged
out. :-(

Works great for simulation test cases or large statistical circuit
simulations.

del cecchi

Christoph Breitkopf

unread,

Nov 21, 2002, 4:48:58 AM11/21/02

to

a...@pirx.hexapodia.org (Andy Isaacson) writes:

> Greg's presumably referring to compilers for other platforms like Alpha,
> Cray, NEC SX, SGI. None of them generate code for x86. The Intel

He explicitly mentioned SSE2 instructions, i.e. x86.

Regards,
Chris

Jan C. Vorbrüggen

unread,

Nov 21, 2002, 6:55:22 AM11/21/02

to

> Though I at least wouldn't want to write a compiler in Fortran 77 ;-) Do you?

IBM did - the F77 compiler for VM compiled itself.

Jan

Robert Myers

unread,

Nov 21, 2002, 1:24:06 PM11/21/02

to

Jan C. Vorbrüggen wrote:

The only downside that I can see to using Fortran for just about
anything is that it does practically no checking of any kind (bounds and
type-checking on subroutine calls come immediately to mind).

Once you realize that the complete absence of any kind of checking
allows a disciplined programmer to do things that Fortran was not really
intended to do, you can use arrays to create structures of just about
any kind you like, refer to the same memory locations in different (but
consistent) ways, and use variable names to point into and manipulate
the structures you have created.

I don't know that I would recommend this now, but some scientific
programmers were using very non-Fortran coding techniques long before it
was possible for a scientific programmer even to consider using c.

Nick Maclaren

unread,

Nov 21, 2002, 4:13:07 PM11/21/02

to

In article <ax9D9.72814$P31.37562@rwcrnsc53>,

Robert Myers <rmyer...@attbi.com> wrote:
>
>The only downside that I can see to using Fortran for just about
>anything is that it does practically no checking of any kind (bounds and
>type-checking on subroutine calls come immediately to mind).
>
>Once you realize that the complete absence of any kind of checking
>allows a disciplined programmer to do things that Fortran was not really
>intended to do, you can use arrays to create structures of just about
>any kind you like, refer to the same memory locations in different (but
>consistent) ways, and use variable names to point into and manipulate
>the structures you have created.

This is so mistaken and misleading as to be effectively false.

As with most languages, the Fortran specification (i.e. architecture)
neither requires nor forbids checking. There have been some almost
watertight checking Fortran compilers, from Fortran II up to 95. ICL,
Waterloo, Fujitsu and NAG all have produced them - and the last is
current. The Fortran standards body has always been determined that
the design should neither require nor prevent thorough checking.

A few languages have required checking, and a VERY few have made it
effectively impossible ('forbid' is not right word). C belongs to the
latter category.

>I don't know that I would recommend this now, but some scientific
>programmers were using very non-Fortran coding techniques long before it
>was possible for a scientific programmer even to consider using c.

That is certainly true.

Robert Myers

unread,

Nov 21, 2002, 5:14:41 PM11/21/02

to

Nick Maclaren wrote:

> In article ,

> Robert Myers wrote:
>
> >The only downside that I can see to using Fortran for just about
> >anything is that it does practically no checking of any kind (bounds and
> >type-checking on subroutine calls come immediately to mind).
> >
> >Once you realize that the complete absence of any kind of checking
> >allows a disciplined programmer to do things that Fortran was not really
> >intended to do, you can use arrays to create structures of just about
> >any kind you like, refer to the same memory locations in different (but
> >consistent) ways, and use variable names to point into and manipulate
> >the structures you have created.
>
>
> This is so mistaken and misleading as to be effectively false.

Ouch. If you wanted to use bounds checking on Fortran on the Cray
machines I have worked with, you could do it, but you had to expect that
your code would run at least 10x slower because nothing would vectorize
and because of the extra overhead. I believe it was possible to use
bounds checking on the CDC 7600, but you had to run it with opt=0. I
have never used a Fortran compiler that would even permit explicit
type-checking, but it *has* been over a decade.

It would have been more accurate if I had said that the way Fortran was
generally being used at the time the Fortran compiler under discussion
was written, it was not easy or common practice to use bounds checking,
and the Fortran compilers I used didn't even permit type checking, since
object modules could be compiled separately with nothing like a header
file to provide the information required to do type checking. Even with
no header files or function declarations to rely on, you could embed
information in the object file to let the linker do type checking, but
if any compiler/linker I ever used actually did that, I was unaware of it.

A particularly dangerous (but therefore very powerful) construct was the
Fortran Common block. A scientific programmer who was supposed to be
helping me unintentionally introduced me to the power of this construct
by changing the common block in the main routine so that it pointed to
LCM on the CDC7600, but didn't bother to change the declarations in any
of the subroutines. Result: the memory for the common block was
allocated in Large Core (Slow) Memory, and the subroutines were trying
to access those variables in Small Core (Fast) Memory. The one test
care the programmer ran worked, so the code was handed over to me in
that fashion. Once I had unscrambled the mess the programmer had made,
I realized that what was a very dangerous weakness of Fortran was also a
license to invent, so I started turning Fortran into something I
regarded as more modern by declaring a large common block in memory and
then managed the memory allocated to the common block myself with pointers.

A vast hole in my experience is that, while most of the rest of the
world was using Fortran, I was using PL/I whenever possible, and
therefore I have practically no experience with the IBM Fortran
compilers that practically everyone else was using, even when I *was*
using IBM machines.

If I had compiled my Fortran code with type and bounds checking, I would
have been much more constrained in the manner in which I coded, and I
would have been forced to code in a style that I regarded as ancient,
since I had been weaned on PL/I.

>
> A few languages have required checking, and a VERY few have made it
> effectively impossible ('forbid' is not right word). C belongs to the
> latter category.

Are you really claiming that it is easier to do type-checking in Fortran
(ancient or modern) than in c?

Nick Maclaren

unread,

Nov 21, 2002, 5:59:58 PM11/21/02

to

In article <lVcD9.74417$%m4.3...@rwcrnsc52.ops.asp.att.net>,

Robert Myers <rmyer...@attbi.com> wrote:
>
>> A few languages have required checking, and a VERY few have made it
>> effectively impossible ('forbid' is not right word). C belongs to the
>> latter category.
>
>Are you really claiming that it is easier to do type-checking in Fortran
>(ancient or modern) than in c?

Yes. To a level that most people simply cannot believe. Would you
like one of my documents that touches on this issue? Be warned: it
shouldn't be read late at night ....

David Gay

unread,

Nov 21, 2002, 6:06:53 PM11/21/02

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> A few languages have required checking, and a VERY few have made it
> effectively impossible ('forbid' is not right word). C belongs to the
> latter category.

A case could be made that most languages with explicit deallocation
make checking rather hard. A dangling pointer is a type-safety-violation
waiting to explode... I believe Ada places deallocation in the "unsafe"
catgeory for precisely this reason?

--
David Gay
dg...@acm.org

Terje Mathisen

unread,

Nov 22, 2002, 2:39:05 AM11/22/02

to

David Gay wrote:

When free'ing stuff in C, I try to always set the pointer to NULL
afterwards. That way I have a lot better chance to catch any
unintended/late derefs.

This doesn't help so much if said pointer have been aliased of course,
so if I need to do that (i.e. to step a pointer along a malloc'ed
array), I'll use local variables to do so. These pointers will at least
go away as soon as I leave the current function.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Nick Maclaren

unread,

Nov 22, 2002, 4:42:30 AM11/22/02

to

In article <s713cpu...@lagaffe.CS.Berkeley.EDU>,

While that has a degree of truth in it, it is not so. What automatic
deallocation does is to replace one sort of error with another (i.e.
a bad pointer use by a logic error). It does not eliminate a cause
of error, nor does it make checking for the error any easier, though
it is possible to claim that it reduces (note reduces, not removes)
the chances of the error causing complete application collapse and
the failure of all application diagnostic mechanisms.

The compromise that I would favour is to have an explicit checking
model, and specify that the automatic deallocator should diagnose
any objects that have been deallocated but are not deallocatable.
Even just an explicit call to check consistency would be a great
help.

While the detailed point is rather off-group, the generic point is
very relevant. Modern architectures (both hardware and software)
have neglected error detection to the point that locating complex
errors is now MUCH harder than it was 25 years ago. At all levels,
at least from the ISA and device interfaces upwards, the approach
has been to reduce trapping because getting reliable trapping right
makes it very hard to run at extremely high speeds.

But defining errors out of existence does not eliminate them; in
general, it just changes them into logic errors, which are Someone
Else's Problem.

I am increasingly seeing this at the hardware/software interface.
If there is an intermittent transient error on high-speed equipment,
it is very rare for there to be any diagnostic tools between running
the hardware stress tests (which show clean) and the applications
themselves (which detect an impossible state). Somewhere, somehow,
something has made a logic error - and, if you can tell a firmware
from a (binary distribution) device driver problem, you are a wiser
man than me :-(

Tom Gardner

unread,

Nov 22, 2002, 7:08:15 AM11/22/02

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in
news:arku66$mk3$1...@pegasus.csx.cam.ac.uk:

> In article <s713cpu...@lagaffe.CS.Berkeley.EDU>,
> David Gay <dg...@lagaffe.CS.Berkeley.EDU> wrote:
>>
>>nm...@cus.cam.ac.uk (Nick Maclaren) writes:
>>> A few languages have required checking, and a VERY few have made it
>>> effectively impossible ('forbid' is not right word). C belongs to the
>>> latter category.
>>
>>A case could be made that most languages with explicit deallocation
>>make checking rather hard. A dangling pointer is a type-safety-violation
>>waiting to explode... I believe Ada places deallocation in the "unsafe"
>>catgeory for precisely this reason?
>
> While that has a degree of truth in it, it is not so. What automatic
> deallocation does is to replace one sort of error with another (i.e.
> a bad pointer use by a logic error). It does not eliminate a cause
> of error, nor does it make checking for the error any easier, though
> it is possible to claim that it reduces (note reduces, not removes)
> the chances of the error causing complete application collapse and
> the failure of all application diagnostic mechanisms.
>
> The compromise that I would favour is to have an explicit checking
> model, and specify that the automatic deallocator should diagnose
> any objects that have been deallocated but are not deallocatable.
> Even just an explicit call to check consistency would be a great
> help.

That sounds as if it is of equivalent complexity to a "proper"
garbage collector, but without having the benefits of a GC.

Naturally a GC doesn't solve all problems, notably undesired
data-retention problems (aka data cancer), but -- presuming the
GC is correct -- at least you can reliably determine what's
holding onto the data.

Nick Maclaren

unread,

Nov 22, 2002, 7:41:39 AM11/22/02

to

In article <Xns92CE7B782454A...@158.234.29.254>,

Tom Gardner <gard...@logica.com> writes:
|> >
|> > The compromise that I would favour is to have an explicit checking
|> > model, and specify that the automatic deallocator should diagnose
|> > any objects that have been deallocated but are not deallocatable.
|> > Even just an explicit call to check consistency would be a great
|> > help.
|>
|> That sounds as if it is of equivalent complexity to a "proper"
|> garbage collector, but without having the benefits of a GC.

It is of equivalent complexity, yes, but it provides advantages
that a 'standard' garbage collector doesn't.

|> Naturally a GC doesn't solve all problems, notably undesired
|> data-retention problems (aka data cancer), but -- presuming the
|> GC is correct -- at least you can reliably determine what's
|> holding onto the data.

The approach that I favour DOES go a very long way to solving that
one, but matching what the programmer wrote (and hence, presumably,
intended) to what is actually permitted by the language.

It can also detect the other, more subtle, issue where a subprogram
can be spun off as a disconnected vortex, and that a significant
proportion of the calculation takes external input, runs normally,
can be traced and debugged, has visible external effects - but has
no internal effect!

The fundamental issue here is that I regard reliable and efficient
error detection and handling as a first-class aspect of an
architecture, and feel that as much effort should be put into it
as to handling the 'working' cases, but that most current designs
don't.

Hardware may be better than software, but is following the same
path to hell :-(

Tom Gardner

unread,

Nov 22, 2002, 8:05:29 AM11/22/02

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in

news:arl8m3$2mt$1...@pegasus.csx.cam.ac.uk:

>
> In article <Xns92CE7B782454A...@158.234.29.254>,
> Tom Gardner <gard...@logica.com> writes:
>|> >
>|> > The compromise that I would favour is to have an explicit checking
>|> > model, and specify that the automatic deallocator should diagnose
>|> > any objects that have been deallocated but are not deallocatable.
>|> > Even just an explicit call to check consistency would be a great
>|> > help.
>|>
>|> That sounds as if it is of equivalent complexity to a "proper"
>|> garbage collector, but without having the benefits of a GC.
>
> It is of equivalent complexity, yes, but it provides advantages
> that a 'standard' garbage collector doesn't.
>
>|> Naturally a GC doesn't solve all problems, notably undesired
>|> data-retention problems (aka data cancer), but -- presuming the
>|> GC is correct -- at least you can reliably determine what's
>|> holding onto the data.
>
> The approach that I favour DOES go a very long way to solving that
> one, but matching what the programmer wrote (and hence, presumably,
> intended) to what is actually permitted by the language.

Hmm. I come from a background in which one programmer writes their
bit of code in ignorance of what another programmer will want/need
to do in a couple of years time. (Analogy: someone hacking the kernel
doesn't know what the user-level processes will do).

I don't think "your approach" is applicable to "my case". For "my
case" I think GC is the more easily justified technique.

> It can also detect the other, more subtle, issue where a subprogram
> can be spun off as a disconnected vortex, and that a significant
> proportion of the calculation takes external input, runs normally,
> can be traced and debugged, has visible external effects - but has
> no internal effect!

I don't understand what you mean by that. Is there a two line
example or description?

> The fundamental issue here is that I regard reliable and efficient
> error detection and handling as a first-class aspect of an
> architecture, and feel that as much effort should be put into it
> as to handling the 'working' cases, but that most current designs
> don't.

We're in violent agreement on that one. Error handling is one
of those cases in which the tail should wag the dog.

> Hardware may be better than software, but is following the same
> path to hell :-(

A standard game (when in the pub) is to get someone to try to
provide a decent statement of the boundary between hardware
and software; they fail :)

At least RF/optical engineers distrust their measurements until
they've verified that they are measuring what they think they are
measuring. That used to be the case for digital engineers, but is
becoming less so with time, regrettably.

Nick Maclaren

unread,

Nov 22, 2002, 8:26:31 AM11/22/02

to

In article <Xns92CE852CC55A2...@158.234.29.254>,

Tom Gardner <gard...@logica.com> writes:
|> nm...@cus.cam.ac.uk (Nick Maclaren) wrote in
|> news:arl8m3$2mt$1...@pegasus.csx.cam.ac.uk:
|>

|> Hmm. I come from a background in which one programmer writes their
|> bit of code in ignorance of what another programmer will want/need
|> to do in a couple of years time. (Analogy: someone hacking the kernel
|> doesn't know what the user-level processes will do).

Well, so do I. Some code I wrote 30 years ago is in daily use
worldwide.

|> I don't think "your approach" is applicable to "my case". For "my
|> case" I think GC is the more easily justified technique.

Actually, it is, but it would take some explaining, and is probably
inappropriate for here. I never said that it was without any
disadvantages.

|> > It can also detect the other, more subtle, issue where a subprogram
|> > can be spun off as a disconnected vortex, and that a significant
|> > proportion of the calculation takes external input, runs normally,
|> > can be traced and debugged, has visible external effects - but has
|> > no internal effect!
|>
|> I don't understand what you mean by that. Is there a two line
|> example or description?

Not really. But consider an interactive application. A user makes
a complex request, which starts a subprogram to handle it. Due
to a bug, the pointer to the result location is freed and the
subprogram then becomes disconnected. It continues to run, and
can ask for input, produce output and confirmation, update files
and so on. But the results will get lost.

Yes, I have seen that - and in the case you were referring to
above!