> http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf
If you keep the same die size, you get 2X the "area" in going
from the 90nm process to a 65nm process.
Assume that you can double the number of execution units, then
you can get to 256 Gflops.
If you further assume that you can get 1.5X the frequency from
the processor shrink, you can get to 384 Gflops.
Given these assumptions, you're still a bit shy of 1 Tflop.
--
davewang202(at)yahoo(dot)com
Don't forget the difference between "paper shows can be designed" and "is
practical to design..." The device has to be manufacturable (die size),
coolable, powerable, and of wide enough interest to be worth dumping many
millions of dollars into design and fab. I wonder what the NRE will be at
65 nm?
del cecchi
The authors assumed a pretty conservative 500MHz. Your 1.5 scaling
would put it at 750MHz. That still leaves plausible headroom to reach
1 Teraflop.
When I tried the above link, it didn't work, but you can find the
paper by seaching the site on "Merrimac," which is the name the
authors have given to their architecture.
RM
More seriously, "is programmable by the unadventurous clots who
dominate the software industry". There have been many practical
designs for very high-speed computers that fell because of that
one.
For example, can anyone estimate what the ICL DAP would deliver if
implemented in 65 nm technology?
Regards,
Nick Maclaren.
<snip>
>
>Don't forget the difference between "paper shows can be designed" and "is
>practical to design..." The device has to be manufacturable (die size),
>coolable, powerable, and of wide enough interest to be worth dumping many
>millions of dollars into design and fab. I wonder what the NRE will be at
>65 nm?
>
The Imagine processor for image processing has a similar architecture
and has already been prototyped at 15micron to produce 11.8Glop
(32bit) operating at 288MHz. This work is cited by the SC2003 paper.
The L-cubed scaling of the SC2003 paper would put that chip at 55
Gflop on 90nm. Not quite 128Gflop, but not bad for a start.
NRE would probably keep this all science fiction were it not for
streaming media. Whether a version of it will happen that is
appropriate to science is another story.
RM
>On Tue, 18 Nov 2003 14:29:45 -0600, "Del Cecchi" <cec...@us.ibm.com>
>wrote:
>
><snip>
>>
>>Don't forget the difference between "paper shows can be designed" and "is
>>practical to design..." The device has to be manufacturable (die size),
>>coolable, powerable, and of wide enough interest to be worth dumping many
>>millions of dollars into design and fab. I wonder what the NRE will be at
>>65 nm?
>>
>The Imagine processor for image processing has a similar architecture
>and has already been prototyped at 15micron to produce 11.8Glop
Arrgh! ...has already been prototyped at 0.15micron.
RM
It's easy to design something just to meet a "peak" specification.
I don't know how you can design a memory system and code the algorithm
fast enough to take advantage of anywhere near 1 Tflop on the 65nm
node, but if all you want is the 1 Tflop number, here's what you
need.
Start with ISSCC 2003 paper 19.1, Intel's paper on
"A 5 GHz Floating point Multiply-Accumulator in 90nm Dual Vt CMOS"
1 FMAC should get you the 10 SP Gflops @ 5 GHz on 90nm.
The testchip fabbed has a die area of about 2 mm^2, consumes 1.2W @
1.2V. The FMAC looks to be about 1/4 of the test chip, so about
0.5mm^2 for the FMAC @ 90nm. Assume that 1/2 of the power is consumed
by the FMAC, that's about 0.6W per FMAC
Now, if you can pack 100 of those on a 90nm chip, that's only
50mm^2 and 60W. No problem to get to 1 Tflop peak at 90nm,
Now, how do you design a register file and cache subsystem to keep up
with 100 of these creatures @ 90nm (or 65nm). How do you design a
memory system to support the required on/off chip data movement? How
do you design the control flow and data flow? I think the data and
control flow logic and the RF + caches + offchip I/O would make this
chip a monster at 90nm, and would dwarf the area of the
Even when we get to 65nm, such a chip would still be a rather large
creature.
Feasible? Sure. It's only money. Pay someone (in this case Intel,
but could be anyone) enough money, and I'm sure they can put together
a design for you. But is it practical? Is this what Cell would
look like? (64 FMAC's at ~8 GHz each, per chip)
My feeling is that no, this isn't a practical alternative to
suggest that this is what "Cell" is.
It would be interesting to design such a chip and put the 1 Tflop
(single chip) number out there for people to stare at though.
--
davewang202(at)yahoo(dot)com
<snip>
>
>Feasible? Sure. It's only money. Pay someone (in this case Intel,
>but could be anyone) enough money, and I'm sure they can put together
>a design for you. But is it practical? Is this what Cell would
>look like? (64 FMAC's at ~8 GHz each, per chip)
>
>My feeling is that no, this isn't a practical alternative to
>suggest that this is what "Cell" is.
>
>It would be interesting to design such a chip and put the 1 Tflop
>(single chip) number out there for people to stare at though.
There are a couple of questions at play here, and they're not all
equally interesting.
Q. Is the Stanford team pursuing an achievable objective or is it
science fiction?
A. Investigators at Stanford have fabbed a prototype that makes the
numbers of the SC2003 paper look at least reasonable.
Q. Is 1 Teraflop achievable on a 65nm process?
A. Probably yes, but who cares really. One teraflop is just a bogey.
Q. Will the PS3 implementation of Cell use streaming architecture?
A. Who cares. We'll find out when PS3 comes out.
Q. Will PS3 achieve 1 teraflop?
A. Double who cares. Somebody in marketing will have an explanation
for whatever they produce.
Q. Is it possible to obtain much higher throughput with lower power
with a streaming architecture than with a conventional microprocessor
architecture?
A. Yes.
Q. Will streaming architectures be used other than in GPU's?
A. That's the question I'd like to know the answer to.
RM
snip
> Q. Will streaming architectures be used other than in GPU's?
>
> A. That's the question I'd like to know the answer to.
I think the answer is all about software. The team showed substantial
speedup on some selected, recoded programs. The next questions, of which I
have no answers, but people far more familiar with these kinds of
applications might are.
1. Are there a substantial number of important applications that are
amenable to implementation in a stream friendly way?
2. How difficult is it to modify existing applications to make them
stream friendly?
--
- Stephen Fuld
e-mail address disguised to prevent spam
I agree strongly.
|> The next questions, of which I
|> have no answers, but people far more familiar with these kinds of
|> applications might are.
I have answers, but they don't help :-(
|> 1. Are there a substantial number of important applications that are
|> amenable to implementation in a stream friendly way?
Yes, but probably a minority.
|> 2. How difficult is it to modify existing applications to make them
|> stream friendly?
In some cases, easy. In others, foul.
However, the question isn't about applications so much as the
mathematical models used and the programming paradigms used to
implement those models. I.e. if you want to solve significantly
more problems that way, you have to rethink your approach.
Regards,
Nick Maclaren.
> Q. Will streaming architectures be used other than in GPU's?
Yes. As we make computers out of smaller and smaller devices, the cost
of communication (in terms of time and energy) grows too much to
continue building completely centralized architectures. These trends
will only accelerate as we get into funky sub-lithographic nano-wire/
tube technologies. In fact, I believe that soon we will see the fastest
single-threaded, strictly maintain the illusion of one instruction
executing after another-type processor that will ever be built.
How's that for a stake in the ground from a nobody on the internet?
Benjamin
<snip>
>In fact, I believe that soon we will see the fastest
>single-threaded, strictly maintain the illusion of one instruction
>executing after another-type processor that will ever be built.
>
The working assumption of many people (including me) is that, for the
most part, people will continue to think and code in a serial
imperative style for a long time, no matter what kind of processor is
actually doing the computing.
Even though we are pressing hard up against some limits imposed by
physics, I think that CPU designers and compiler writers are going to
have to figure out a way to maintain that illusion, at least at the
level of languages that everyday programmers use, for a long time.
I am optimistic in the thought that we have barely scratched the
surface of what is possible in terms of hiding parallelism, including
streaming parallelism, from users.
RM
For what it's worth: it's on par with other big truths like "in 30 years
there will be no more oil, so we'll have to use new energy forms", and
"we have enough weapons of mass destruction to blow the whole planet up,
we have to get rid of them". Obvious, true, and almost nobody seems to
give a damn, as long as the old ways appear to be working economically,
and as long as the economically successfull are those who make the money.
Apologies to comp.arch readers for straying from the newsgroup topic.
I'll shut up again, now.
best regards
Patrick
I came across an exciting, and very troubling 1999 document from the
IBM web site about Blue Gene:
www.research.ibm.com/journal/sj/402/allen.pdf
As with most documents that come out of IBM research, the document is
candid and thorough. It lays out a program of research concerning
human proteins using the Blue Gene architecture. It is hard to
imagine a scientific undertaking of greater potential importance to
humanity. It gives a thorough summary of the background of the
problem and the approaches people have used to solve it.
These are smart people. They have thought hard about what they are
trying to do. But, in the end, they have decided to go ahead with a
project using computational resources that are inadequate to the task
they have laid out.
The problems they face are too numerous to discuss in a single web
post, but the bottom line is that they will have to use up months, if
not years, of computing time to get results using models that are at
best educated guesses.
If there is a parallel in the history of science, I am not aware of
it. The US wants to build the world's biggest computer, IBM wants to
build it for them, and both need a problem that justifies such an
enormous expenditure of money and talent. The conclusion that they
should reach, that the available computational muscle available to
them is not up to the task they have proposed is one they are
unwilling to reach.
IBM is very proud, I am sure, of it's refrigerator-sized furnace at
SC2003, and they should be. It just won't serve as a basis for
solving the problem they want to solve, and no computer built on the
old ways will.
RM
I agree with you, with emphasis on "for the most part". I think we can
get a whole lot more parallelism with modest changes to what programmers
and compilers have to do. I think a good analogy is the Cyclone
language (http://www.research.att.com/projects/cyclone/). Its designers
managed to make a language that looks a whole lot like C, but has much
stronger safety guarantees. I am currently working on new ISAs that
look a whole lot like more traditional ones, but allow a much higher
degree of parallel execution.
Benjamin
Like those other truths, "a third the population of the US will starve
by 1990", "we are heading into an ice age", "we can't do optical
lithography below the wavelength of light so we better build a cyclotron
for x-ray lithography" What would any one want a computer in their
house for? There is a market for maybe five computers in the country.
I could throw in "what'll we do with all those mips" or some others.
We've been going to run out of oil in 30 years for 50 years.
del cecchi
The current rate of laying down new hydrocarbon stocks is quite low. The
current oil stocks were laid down in a much warmer climate. Oil extraction
and usage has not declined, so its hard to see how we won't run out of oil
sometime, although it may be more than 30 years.
Peter
And why would anyone fool with risky streaming architectures when you
can build the world's biggest computer with Power4's running at
750MHz?
Obviously, you wouldn't, if you can get the powers that be to buy into
and/or sell the idea that said computer will be able to do the
impossible.
Sorry, Del, but Blue Gene makes about as much sense as the Space
Shuttle did.
RM
Actually the space shuttle made sense. It just didn't work out like
they envisioned. It was to be a prototype for a next generation of
reusable orbital vehicles, as I recall. Instead folks lost interest and
the prototype became the final. How like computer architecture is that?
As for blue gene, and recalling that I am a circuit designer and not an
architect, it is an initial member of what might be a family of
massively parallel supers. Why are you so convinced that some bizarre
streaming architecture is the answer, and massively parallel systems are
dinosaurs or kluges? At least that is what I interpret your reference
to the shuttle as meaning.
If the idea is to compute protein folding, then maybe using like a
million processors is in fact the way to go. And 128K processors is a
good start. Cray is gluing opterons together. IBM glued 690's
together, now they are gluing way more 440's or whatever.
I guess the proof of the pudding will be in the eating. We'll see if
useful science gets done. And the alternative proposals will get funded
if they can convince folks that they can do useful science. IBM would
even help them, subject to satisfactory terms and conditions.
del
>
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:tfbasvonv9lg42e7v...@4ax.com...
>> >
>>
>> And why would anyone fool with risky streaming architectures when you
>> can build the world's biggest computer with Power4's running at
>> 750MHz?
>>
>> Obviously, you wouldn't, if you can get the powers that be to buy into
>> and/or sell the idea that said computer will be able to do the
>> impossible.
>>
>> Sorry, Del, but Blue Gene makes about as much sense as the Space
>> Shuttle did.
>>
>
>Actually the space shuttle made sense. It just didn't work out like
>they envisioned. It was to be a prototype for a next generation of
>reusable orbital vehicles, as I recall. Instead folks lost interest and
>the prototype became the final. How like computer architecture is that?
>
Well, no it didn't, and it still doesn't. Trying to fly something
with wings to the ground from orbit and to get it to land like an
airplane is a dumb, and, as we have learned, a very dangerous idea.
The guys with blue uniforms and eagles and silver stars on their
epaulets got those eagles and silver stars, for the most part, flying
things with wings, and thus, wings it shall be. A parachute, you
know, is something you use when you've been shot down.
The space shuttle was an outgrowth of the DynaSoar lifting body
program proposed in the late 60's. There is just no way to get around
it: to dissipate the kinetic energy of orbiting around the earth, you
have to dissipate alot of heat, and a lifting body just isn't the
right shape for doing it safely.
The space shuttle was sold to congress based on completely unrealistic
estimates of just about everything: cost, safety, schedule; and the
program as it has played out has borne no resemblance at all to what
was sold. Not mind you, because the rocket scientists didn't deliver,
but because the original plan was just plain stupid.
>As for blue gene, and recalling that I am a circuit designer and not an
>architect, it is an initial member of what might be a family of
>massively parallel supers. Why are you so convinced that some bizarre
>streaming architecture is the answer, and massively parallel systems are
>dinosaurs or kluges? At least that is what I interpret your reference
>to the shuttle as meaning.
>
I'm not a computer architect, either, but I can read, and I know at
least a little bit about the physics and the computational challenges
involved. The estimates of how unrealistic the enterprise is aren't
my own. They're taken straight out of an IBM document.
Were I a master of this subject (computer architecture), and I am not,
I would be able to make an eloquent and succinct statement of my own.
As it is, I can think of no more eloquent and succinct statement than
the one that has already been made: bandwidth is expensive, arithmetic
is free.
Not only am I convinced that the point the statement (which is a bit
of hyperbole) is trying to make essentially true, but I believe that
the reality it is summarizing is going to dominate the future of
high-peformance computation.
In a classical computer architecture, you fetch something from
somewhere (disk, main memory, cache, or even a register), you do
something with it, then you put it back. The balance in the cost of
"doing something with it" and fetching the data and putting it back
has changed dramatically and is going to keep changing in the same
direction until the hyperbolic statement of the Merrimac group is just
about literally true.
In a streaming architecture, you fetch data from somewhere, you do
something with it, you do something else with it, you do something
else with it, you do something else with it, ad nauseum, until you
have absolutely run out of operations you can stream together.
*Then*, and with great reluctance, you bear the cost of putting the
data where you will need to bear the cost of fetching it again.
And yes, I am implying that a classical architecture is a dumb plan
for problems like the protein-folding problem just like the shuttle is
a dumb way to get things into and out of orbit.
>If the idea is to compute protein folding, then maybe using like a
>million processors is in fact the way to go. And 128K processors is a
>good start. Cray is gluing opterons together. IBM glued 690's
>together, now they are gluing way more 440's or whatever.
>
Blue Gene and Red Storm make excellent sense for doing the best you
can with what you've got. The people who make the budget decisions in
Washington are trying to get supercomputing back on track by paying
people to build big machines. That's not an indefensible exercise. I
just wish they would put more money into moving us away from a
computer architecture that is showing its age and its limitations.
Too much money is going into sheer size and not enough into finding
new ways of doing business.
>I guess the proof of the pudding will be in the eating. We'll see if
>useful science gets done. And the alternative proposals will get funded
>if they can convince folks that they can do useful science. IBM would
>even help them, subject to satisfactory terms and conditions.
>
At one time, IBM was able to take great risks of its own in pursuing
fundamental research. It no longer can, and I don't expect it to.
RM
Uhhhh.... ok. You aren't an aerospace engineer or historian.
And it shows.
I don't know what you've been reading, but I'm going to suggest
that you get yourself to a library and/or Amazon and read
Dennis Jenkins _Space Shuttle: The History of the National
Space Transportation System: The First 100 Missions_
(ISBN 0963397451) as soon as you can.
Going backwards through your points...
The Space Shuttle program was originally sold assuming an R&D
program which was twice as expensive as the one which Congress
and Nixon's budgeteers eventually authorized. We will never
know whether the original concepts would have been as
safe, reliable, and operational as the original plan
intended. Congress said "Chop the budget in Half" and
NASA failed to rescope the deliverables used in PR speak,
but the vehicle changes that resulted were clearly known
to anyone looking at the analysis.
The wide variety (literally hundreds) of concept designs
before and after the rescope, going back into the early 1960s
in fact, are well documented in Jenkins' book (and other
places).
I am an aerospace engineer and have designed re-entry
vehicles (not that have flown, but done concept design
work) and I have *no* idea what you are talking about
regarding the lifting body shape not being 'safe'.
There are tradeoffs from ablative thermal protection
systems (heavier than tiles and metallic shingles and
the like, but can withstand higher peak loads) that
make them work better with capsules. But there is no
law of nature that reusable thermal protection is unsafe
or that lifting re-entry / lifting bodies are somehow
inherently unsafe. I say this as a capsule bigot and
someone who pushed very hard for OSP to consider capsules.
Capsules are cheaper. Safety can be done right or wrong
with either capsules or lifting bodies.
The comments regarding pilots and flying things are off
base as well. To reuse a large vehicle it has to be
flown to a relatively low velocity pinpoint landing.
At the time Shuttle was being designed, using propulsion
to land rather than wings was thought to be a borderline
crackpot idea. It took until DC-X for a flight demonstrator
to convince everyone otherwise. Even today, there are advocates
both for wings and for powered vertical landings, and there
are clear technical arguments to be made on either side.
If you would like to continue this thread,
the sci.space.policy and sci.space.shuttle newsgroups
might be more appropriate.
-george william herbert
gher...@retro.com
But that's not true.
Bandwidth x distance is expensive. Lots of bandwidth over
short distance is not very expensive. Lots of bandwidth across
small distances on a modern chip is very very very cheap.
For highly partitionable problems, the cost-effective
optimum partition size can be analyzed and modeled by
looking at the costs of transmitting partition cell
edge state to neighbors versus storing/calculating it
locally. For given problems and chip technologies there
are different optimizations.
By putting a large number of tiny but moderately powerful
CPU/Custom FP units per ASIC chip, they are getting very
good average neighbor to neighbor bandwidth. Or at least
can do so and presumably did. Going off-chip to the
neighbors on a circuit board hurts, but again is subject
to cost / technology optimization, along with the
CPU capacity per unit and bandwidth internally...
This is not hard. This is actually computer architecture
childs play to rough sketch out. What's hard is finding
and formulating problems that play well with such architectures.
I have been told that proteins and 3-D FEM are very amenable
to such work, by people doing protein work commercially
and based on hints coming out of the places doing nuclear
weapons simulation (or so I think... they like being semi-
opaque...).
-george william herbert
gher...@retro.com
As a zeroth-order approximation, I would say it _is_ true. You added
the first-order and possibly second-order corrections.
I'm sure everybody has seen a cross-section of a mammalian - say, a human
- brain. You may remember that it consists of a darker material - called
the gray substance in anatomy - surrounded by a thin border of a lighter
material, which is called the white substance. This border is only about
2-3 millimeters thick.
Now, _all_ of the grey substance is long-distance wiring - George's "going
off-chip to the neighbors on a circuit board". And this long-distance wiring
is a special invention of higher animals, basically a coax cable with
integrated interspersed amplification devices. This increases signal speed
from about 1 m/s to around 100 m/s, at the cost of losing density (I would
estimate about a factor of 30-100) and the cost of the amplifiers (energy,
homostasis etc).
So that saying about thinking with grey cells is actually quite off the
mark...the grey cells are in fact the dielectric of the coax cables!
The small border of white substance does contain the arithmetic devices -
the neuron cell bodies (yes, even the nucleus is involved in that, e.g.,
in memory, by controlling synthesis of proteins) and the dendritic trees.
However, a large part of even that volume is wiring - the axonal trees,
which correspond somewhat to George's "getting very good average neighbor
to neighbor bandwidth". These are naked wires running at those slow 1 m/s,
but they are short, at most a few millimeters. (Remember this is an
asynchronous distributed machine running at an approximate cycle time of
the order of 100-200 Hz.) And because they are naked, you can - and the
brain does - pack a lot of them into a small volume. And it does so in
three dimensions.
In fact, the human brain is always on the border of equilibrium, due to
the dense packing in both white and grey substance, with regard to its
operating conditions (temperature, electrolytic balance, etc.), and can
destroy itself through too much activity (cf. epileptic seizures).
Physical damage to this system - e.g., through a stroke - occurs more to
the wiring than to the neurons. Due to the fact that the wiring is,
histologically, part of the neurons which don't like their axons being
cut, damage to the grey substance also leads to some consequential damage
(usually quite distributed) of the white substance.
So that impressive computing device, the mammalian brain, does consist
mainly of wiring, but the arithmetic is not quite for free - for one thing,
it must be carefully balanced locally and globally to avoid running out of
equilibrium. (There's a hypothesis that you can't have a brain much larger
than the human one because you cannot reliably cool it.)
Jan
Why "the future" - where, do you think, went the money in designing
and building a Cray-2, say?
> In a streaming architecture, you fetch data from somewhere, you do
> something with it, you do something else with it, you do something
> else with it, you do something else with it, ad nauseum, until you
> have absolutely run out of operations you can stream together.
> *Then*, and with great reluctance, you bear the cost of putting the
> data where you will need to bear the cost of fetching it again.
And I thought that was the whole point of optimizing compilers that do
blocking and interprocedural analysis...?
> The people who make the budget decisions in
> Washington are trying to get supercomputing back on track by paying
> people to build big machines. That's not an indefensible exercise. I
> just wish they would put more money into moving us away from a
> computer architecture that is showing its age and its limitations.
> Too much money is going into sheer size and not enough into finding
> new ways of doing business.
The problem will be that you will need both to be successful in this
endeavour: you need a solution scaleable to sufficient size. Building
small proof-of-principle demonstrators doesn't cut it.
> At one time, IBM was able to take great risks of its own in pursuing
> fundamental research. It no longer can, and I don't expect it to.
IBM no longer does - have the closed the Zurich lab, Almaden, San Jose,
...?
Jan
>Robert Myers <rmy...@rustuck.com> wrote:
>>>> Sorry, Del, but Blue Gene makes about as much sense as the Space
>>>> Shuttle did.
>>>
>>>Actually the space shuttle made sense. It just didn't work out like
>>>they envisioned. It was to be a prototype for a next generation of
>>>reusable orbital vehicles, as I recall. Instead folks lost interest and
>>>the prototype became the final. How like computer architecture is that?
>>
>>Well, no it didn't, and it still doesn't. Trying to fly something
>>with wings to the ground from orbit and to get it to land like an
>>airplane is a dumb, and, as we have learned, a very dangerous idea.
>>
<snip>
>
>Uhhhh.... ok. You aren't an aerospace engineer or historian.
>And it shows.
>
<snip>
>
>I am an aerospace engineer and have designed re-entry
>vehicles (not that have flown, but done concept design
>work) and I have *no* idea what you are talking about
>regarding the lifting body shape not being 'safe'.
I am an aerospace engineer. I am not a historian. I have a pretty
good idea of the state of re-entry vehicle engineering was when the
shuttle first flew because I was involved in it at the time.
>There are tradeoffs from ablative thermal protection
>systems (heavier than tiles and metallic shingles and
>the like, but can withstand higher peak loads) that
>make them work better with capsules. But there is no
>law of nature that reusable thermal protection is unsafe
>or that lifting re-entry / lifting bodies are somehow
>inherently unsafe. I say this as a capsule bigot and
>someone who pushed very hard for OSP to consider capsules.
>Capsules are cheaper. Safety can be done right or wrong
>with either capsules or lifting bodies.
>
There may be no law of nature, but the engineering constraints are
pretty obvious.
One of the proposals for providing a lifeboat for present-day
astronauts is to pull one of the Apollo vehicles out of a museum,
modernize the electronics, put a new ablative heat shield on it, and
fly it again.
You can't put a new ablative heat shield on the shuttle because there
isn't one. If you allow a surface that you later intend to use like a
wing to cool iself by ablation, you are left with the aerodynamics of
whatever shape the erosion that occurred during reentry produced. You
can't even allow the surface to warp and ripple unpredictably, which
is why we have gaps in these foam tiles that are glued onto felt.
When the man on the street hears about the design details of shuttle
tiles, his common sense reaction is, "You've got to be kidding me."
It all comes from the impossible constraints produced by trying to put
an airplane wing through re-entry.
The aerodynamics of a capsule are pretty simple, and you can allow the
surface that takes the heat to ablate. It works, it is very
forgiving, and the aerodynamic surface you are counting on to get you
back to sea level at a reasonable speed is a parachute that was safely
stowed during reentry. By comparison, those damn foam tiles were
falling off faster than they could glue them back on at one point.
http://www.house.gov/science/hearings/space03/may08/myers.htm
Testimony to the House Subcommittee on Space and Aeronautics
On the Assessment of Apollo Hardware for CRV and CTV
by Dale Myers (no relation)
May 8, 2003
Although the team was not asked to compare the capsules to winged
vehicles, and we did not, I have some comments relative to wings vs.
capsules.
The Apollo Program never had a parachute failure in operation,
although we had failures during the test program. We had one
parachute fail due to N2O4 leaking onto the shrouds, but the vehicle
landed safely on two parachutes.
The Shuttle has had a wing failure, but the failure was apparently
caused by the foam insulation from the tank. Shuttle runway landings
have been 100% successful.
It appears to me that the robust launch escape system of Apollo, which
worked over a wide range from the launch pad to high altitude, will be
hard to beat in a winged vehicle.
This Apollo based system, without aerodynamic controls, wings, and
landing gear is clearly simpler.
The ablative replaceable heatshield is simpler to build and install
than the corresponding winged vehicle thermal protection system. We
already know the thermal distribution on the vehicle. With a land
landing, a reusable heatshield might apply to the Apollo system.
<snip>
If all things were equal, I’d choose winged vehicles. Unfortunately,
they are not known to be equal, and that’s why the team recommended a
thorough study of the Apollo CM/SM as a CRV/CTV.
<end quotation>
CRV/CTV=Crew rescue vehicle, crew transfer vehicle for the orbiting
space station.
As it is, Mr. Myers (again, no relation) is being very kind about
those fragile wings. You can put more engineering effort into the
foam insulation on the external fuel tank, but shuttle tiles have
fallen off in previous flights, and they will continue to fall off if
the current shuttle thermal management system continues to be used.
>
>The comments regarding pilots and flying things are off
>base as well. To reuse a large vehicle it has to be
>flown to a relatively low velocity pinpoint landing.
There is no reason that I know of that an Apollo-type system cannot be
used to meet all future needs of the US space program. Vehicles like
the shuttle never have and, in my opinion, never will get us beyond
LEO. The real reason capsules won't be given serious consideration is
the one I mentioned. The blue suits want an airplane. Have an
apoplectic fit if you like, but that's how it works.
<snip>
>
>If you would like to continue this thread,
>the sci.space.policy and sci.space.shuttle newsgroups
>might be more appropriate.
>
I have very little interest in pursuing this discussion. I answered
at length here because you, in effect, said that I did not know what I
was talking about here.
RM
>> Not only am I convinced that the point the statement (which is a bit
>> of hyperbole) is trying to make essentially true, but I believe that
>> the reality it is summarizing is going to dominate the future of
>> high-peformance computation.
>
>Why "the future" - where, do you think, went the money in designing
>and building a Cray-2, say?
>
>> In a streaming architecture, you fetch data from somewhere, you do
>> something with it, you do something else with it, you do something
>> else with it, you do something else with it, ad nauseum, until you
>> have absolutely run out of operations you can stream together.
>> *Then*, and with great reluctance, you bear the cost of putting the
>> data where you will need to bear the cost of fetching it again.
>
>And I thought that was the whole point of optimizing compilers that do
>blocking and interprocedural analysis...?
>
Current microarchitectures have the following paradigm:
Get something. Do something to it. Put it back.
There is not a thing a compiler can do about that. Between operations
(except in the case of an FMAC, which is an example of streaming),
operands have to sit in registers and they have to be ferried back and
forth.
The Merrimac authors were not referring to bandwidth to memory or even
to cache. They were talking about bandwidth within the CPU itself.
In a streaming architecture there is no getting and putting; the
output of one functional unit feeds right into the input of another.
>> The people who make the budget decisions in
>> Washington are trying to get supercomputing back on track by paying
>> people to build big machines. That's not an indefensible exercise. I
>> just wish they would put more money into moving us away from a
>> computer architecture that is showing its age and its limitations.
>> Too much money is going into sheer size and not enough into finding
>> new ways of doing business.
>
>The problem will be that you will need both to be successful in this
>endeavour: you need a solution scaleable to sufficient size. Building
>small proof-of-principle demonstrators doesn't cut it.
>
Erg. Japan has a world class vector processor, and the US is lashing
together thousands of Opterons and Power4's.
RM
>Robert Myers <rmy...@rustuck.com> wrote:
>>Were I a master of this subject (computer architecture), and I am not,
>>I would be able to make an eloquent and succinct statement of my own.
>>As it is, I can think of no more eloquent and succinct statement than
>>the one that has already been made: bandwidth is expensive, arithmetic
>>is free.
>
>But that's not true.
>
>Bandwidth x distance is expensive. Lots of bandwidth over
>short distance is not very expensive. Lots of bandwidth across
>small distances on a modern chip is very very very cheap.
>
That's why streaming architectures are attractive, and the point that
everyone seems to be missing, or at least that no one bothers to
acknowledge, is that it is movement of data *on the chip* that is
expensive--in terms of power consumption. I summarized the argument
in another post here, and I'm not going to summarize it again. It's
in the sc2003 merrimac paper
http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf
>For highly partitionable problems, the cost-effective
>optimum partition size can be analyzed and modeled by
>looking at the costs of transmitting partition cell
>edge state to neighbors versus storing/calculating it
>locally. For given problems and chip technologies there
>are different optimizations.
>
>By putting a large number of tiny but moderately powerful
>CPU/Custom FP units per ASIC chip, they are getting very
>good average neighbor to neighbor bandwidth. Or at least
>can do so and presumably did. Going off-chip to the
>neighbors on a circuit board hurts, but again is subject
>to cost / technology optimization, along with the
>CPU capacity per unit and bandwidth internally...
>
You really do need to read the paper.
>This is not hard. This is actually computer architecture
>childs play to rough sketch out. What's hard is finding
>and formulating problems that play well with such architectures.
>I have been told that proteins and 3-D FEM are very amenable
>to such work, by people doing protein work commercially
>and based on hints coming out of the places doing nuclear
>weapons simulation (or so I think... they like being semi-
>opaque...).
>
The protein folding stuff may all just be misdirection.
RM
Is there any indication that this is in any way a performance limiter!?
> The Merrimac authors were not referring to bandwidth to memory or even
> to cache. They were talking about bandwidth within the CPU itself.
> In a streaming architecture there is no getting and putting; the
> output of one functional unit feeds right into the input of another.
Ah blech, old wine in new skins - that's systolic architectures all over
again. No thanks.
> Erg. Japan has a world class vector processor, and the US is lashing
> together thousands of Opterons and Power4's.
So what? The "vector processor" part is irrelevant, in the sense that
Opterons (certainly) and Power4s (probably) already _are_ vector
processors: they just process short vectors per instruction. But
instruction issue, again, is not the limiting factor. See the optimization
paper for the Athlon: basically, you make the code (whether hand-written
in C et al. or compiler-generated) mimic a Cray, with L1 cache taking the
place of the vector registers.
The question is, what's a cost-effective way to get all those functional
units to work together on one problem? Or, to put it into a more operational
setting: given some objective goals such as providing a defined service
level of weather prediction (accuracy, latency, turn-around, number of local
predictions made), what is the setup that needs the least amount of money
to provide this service - with "money" including investment (hardware) costs
_and_ all costs required to get the solution to work (most importantly,
personnel).
I'd say _that_ question isn't settled one way or the other.
Jan
>> There is not a thing a compiler can do about that. Between operations
>> (except in the case of an FMAC, which is an example of streaming),
>> operands have to sit in registers and they have to be ferried back and
>> forth.
>
>Is there any indication that this is in any way a performance limiter!?
>
If you believe that joules per flop is the ultimate limiting factor,
then, yes, movement of data on the chip is the performance limiter.
Movement from a functional unit to a register accomplishes nothing,
and, as feature sizes shrink, the energy cost of moving the data will
exceed the cost of performing the arithmetic.
>> The Merrimac authors were not referring to bandwidth to memory or even
>> to cache. They were talking about bandwidth within the CPU itself.
>> In a streaming architecture there is no getting and putting; the
>> output of one functional unit feeds right into the input of another.
>
>Ah blech, old wine in new skins - that's systolic architectures all over
>again. No thanks.
>
No, more like a streaming architecture. Exactly like what is done in
modern GPU's.
As to the history of systolic architectures, please leave me out of
it. I ignored them when they were a hot research topic, and you'll
have to do better than name-calling to get me interested in learning
about them.
>> Erg. Japan has a world class vector processor, and the US is lashing
>> together thousands of Opterons and Power4's.
>
>So what? The "vector processor" part is irrelevant, in the sense that
>Opterons (certainly) and Power4s (probably) already _are_ vector
>processors: they just process short vectors per instruction. But
>instruction issue, again, is not the limiting factor. See the optimization
>paper for the Athlon: basically, you make the code (whether hand-written
>in C et al. or compiler-generated) mimic a Cray, with L1 cache taking the
>place of the vector registers.
>
Sure, you can get a modern CPU to do lots of things that are (what's
the right word?) homologous to more specialized architectures. You
just can't get them to do it nearly as fast.
I don't really think all that much of the Earth Simulator or of the
NEC vector processor because I believe that vector processors are too
specialized. I think something more like the Merrimac architecture is
the way to go.
>The question is, what's a cost-effective way to get all those functional
>units to work together on one problem? Or, to put it into a more operational
>setting: given some objective goals such as providing a defined service
>level of weather prediction (accuracy, latency, turn-around, number of local
>predictions made), what is the setup that needs the least amount of money
>to provide this service - with "money" including investment (hardware) costs
>_and_ all costs required to get the solution to work (most importantly,
>personnel).
>
>I'd say _that_ question isn't settled one way or the other.
>
No, I don't think it is, either. What bothers me is what I perceive
to be an enormous misallocation of resources. Too much money going
into building big machines with yesterday's architecture, not enough
going into more basic research on both hardware and software to find
the best way to address problems in a massively parallel fashion.
Yes, finding ways of programming something like Merrimac or a GPU to
do real problems is going to take a massive amount of work and it will
cost alot of money. That money spent is an investment. Once you gain
the knowledge, you've got it forever. Blue Gene will be scrap in a
few years.
And one thing seems certain: as feature sizes shrink, the energy cost
of moving data around, even over very short distances, will dominate
all other energy costs. Even if energy cost nothing (and it
doesn't!), getting the energy out of the CPU will be the limiting
factor in how fast something like a protein-folding calculation can
proceed.
RM
> Actually the space shuttle made sense.
I have to disagree. Even if it didn't need an enourmous amount of
maintenance between flights, and didn't have the rather inelegant
disposable fuel tank and pick-up-from-the-ocean side rockets (which are
an implementation issue) one remains with extra weight to lift that is
not strictly necessary.
With the multipliers involved (look how big a first stage of a rocket is
compared to the next stage, and so on - but if it was just one stage it
would be even worse) lifting extra weight is a lot of cost and effort.
And getting something down takes approximately (minus air resistance)
the same energy to decelerate it as to accellerate.
So if one needed 100 tons of rocket to lift a 1 ton useful load, one
would need a 10,000 ton rocket to bring the 100 to rocket up that could
bring the 1 ton load back safely. This is the reason things are left
behind during the flight: no need to lug them along further.
Now if these multipliers weren't that large one would have the situation
of an airliner. The features to make it reusable probably help safety too.
These days a robot probably could do a lot of the work people are sent
up for.
Thomas
...
> Uhhhh.... ok. You aren't an aerospace engineer or historian.
> And it shows.
Hmmm. While I am neither an aerospace engineer nor an historian, I have
certainly heard people who *are* voice similar opinions, though less
bluntly.
...
> The Space Shuttle program was originally sold assuming an R&D
> program which was twice as expensive as the one which Congress
> and Nixon's budgeteers eventually authorized. We will never
> know whether the original concepts would have been as
> safe, reliable, and operational as the original plan
> intended. Congress said "Chop the budget in Half" and
> NASA failed to rescope the deliverables used in PR speak,
> but the vehicle changes that resulted were clearly known
> to anyone looking at the analysis.
>
> The wide variety (literally hundreds) of concept designs
> before and after the rescope, going back into the early 1960s
> in fact, are well documented in Jenkins' book (and other
> places).
That's all well and good, but does nothing to address the criticisms that
were offered. To do so, you'd have to demonstrate that the areas that were
changed materially affected the viability of the result, rather than simply
scaled back its scope.
>
>
> I am an aerospace engineer and have designed re-entry
> vehicles (not that have flown, but done concept design
> work) and I have *no* idea what you are talking about
> regarding the lifting body shape not being 'safe'.
> There are tradeoffs from ablative thermal protection
> systems (heavier than tiles and metallic shingles and
> the like, but can withstand higher peak loads) that
> make them work better with capsules. But there is no
> law of nature that reusable thermal protection is unsafe
> or that lifting re-entry / lifting bodies are somehow
> inherently unsafe.
To say that there's no known such law of nature avoids addressing the
reality that with today's technology (let alone the 30-year-old technology
used in the Shuttle) we seem to know how to create considerably safer
ablative shielding than reusable thermal protection. Until that situation
changes, the conclusion stands.
I say this as a capsule bigot and
> someone who pushed very hard for OSP to consider capsules.
> Capsules are cheaper.
Exactly. So the primary argument for the shuttle - that it would
dramatically *lower* the costs of placing and maintaining objects in orbit -
evaporates. The Russian concept of 'big dumb boosters' may not have pushed
the frontiers of several areas of technology that the Shuttle pioneered, but
as a means to reach and work in space cost-effectively it seems to have won
hands-down.
Safety can be done right or wrong
> with either capsules or lifting bodies.
While that is certainly true in a relative sense, when compared on an
absolute basis capsules seem considerably safer given current technologies.
>
>
> The comments regarding pilots and flying things are off
> base as well.
Really? I could imagine that the rationale owed as much to the potential
for developing related military technology (where the benefits of an ability
to operate in both atmosphere and vacuum are obvious - the same is also true
of single-stage-to-orbit vehicles) as to a simple 'fly-boy' mentality, but
that's about it: there is no obvious rationale whatsoever for the
atmospheric capabilities of the Shuttle (or SSTO) for NASA use, and several
good cost and safety arguments against it.
To reuse a large vehicle it has to be
> flown to a relatively low velocity pinpoint landing.
Reuse may be over-rated (at least it certainly never operated to reduce
Shuttle costs below that of its non-reusable competition), or the definition
biased. The only part of the system that usually needs to return to Earth
is the crew compartment, and reuse of that (minus its ablative shield)
should be relatively easy - especially if the landing uses water to cushion
the impact on the structure.
Given the difficulty of putting mass into even low Earth orbit, putting any
more up there than is necessary seems silly. The Shuttle structure not only
contains a large cargo bay (which usually returns to Earth empty) but also
its heavy and powerful main engines (and it must be strong enough to survive
their thrust) - neither of which is of much use once LEO has been attained.
As long as propulsion systems and fuel dominate cargo capacity in launch
vehicles (exactly the reverse of the situation with general-purpose
aircraft), using purpose-built vehicles built up from standard (and, where
feasible, reusable) components rather than attempting to create
general-purpose, reusable vehicles at coarser grain seems to make much more
sense.
- bill
> No, I don't think it is, either. What bothers me is what I perceive
> to be an enormous misallocation of resources. Too much money going
> into building big machines with yesterday's architecture, not enough
> going into more basic research on both hardware and software to find
> the best way to address problems in a massively parallel fashion.
On the other hand look at IA-64... Huge sums of money in
"tomorrows" architecture and yet those clunky old ones
seem to stay within touch of it on a tiny fraction of the
IA-64 budget.
The advantage of using stuff you already have to hand
when you're working on something new is that it's a known
quantity, it's almost certainly cheaper in the short-term
(important if you have limited funding), and it allows you
to prototype more, erm, accurately...
Not to mention the rigged-demo potential for the less
scrupulous out there...
> Yes, finding ways of programming something like Merrimac or a GPU to
> do real problems is going to take a massive amount of work and it will
> cost alot of money. That money spent is an investment. Once you gain
> the knowledge, you've got it forever. Blue Gene will be scrap in a
> few years.
>
> And one thing seems certain: as feature sizes shrink, the energy cost
> of moving data around, even over very short distances, will dominate
> all other energy costs. Even if energy cost nothing (and it
> doesn't!), getting the energy out of the CPU will be the limiting
> factor in how fast something like a protein-folding calculation can
> proceed.
Which is precisely where densely packed MPPs running low-
power cores can help, if your app fits. :)
I think that it will be very tough to beat MPPs over the
next few years if you value high throughput and low power
consumption. Yes moving data around off chip is bad, but
there are certainly are apps that can live with that
penalty.
Super-fast cores are just going to eat PSUs and air-con
for breakfast. Furthermore to do a given amount of work
they will *still* need to move data on and off chip anyway
(perhaps not as much as MPPs in many applications YMMV).
Ultimately it's up to the customer to decide if their app
is a good fit. I don't think there will ever be a one-size
fits all solution for this particular pissing-contest.
Don't forget that MPPs have been used to crack open new
problem spaces for many years now. They've come in all
shapes and sizes too - quite often to fit a particular
problem space. Therefore I would argue that you would be
misguided to assert some other approach as being more
original than a MPP design... :)
You should take a look at the MPP machines that have
been and gone and compare the variety in their designs
to vector machines, or boring old Minis, Mainframes
and Micros. I think you will be surprised. Of course
having lots of kinks doesn't mean something is
inherently good, just more interesting at least. :)
Cheers,
Rupert
That is true, but it is only a part of the story.
>The advantage of using stuff you already have to hand
>when you're working on something new is that it's a known
>quantity, it's almost certainly cheaper in the short-term
>(important if you have limited funding), and it allows you
>to prototype more, erm, accurately...
This is NOT true! It is very often cheaper to start from scratch,
and often makes prototyping easier. It is common for a project
that is constrained to use what already exists to spend 75% of its
resources bypassing problems that simply would not exist if it
could have started from scratch.
The point is that you should always start off VERY, VERY simply,
because (a) you may get it roughly right, (b) you will have to
add complications to fix up nasty cases and improve performance
and (c) the second version will be bigger than the first. The
IA-64 project broke this rule, and the transputer one didn't.
Regards,
Nick Maclaren.
That's a pretty strong assertion Nick. Let's take a look
at the options open to the guys looking for a core for
their spiffy new MPP :
1) Design a new core & bring up a new simulator from
scratch.
2) Utilise a standard core that has been stuffed into
many ASICs that has a bunch of simulators and libraries
already tested and implemented.
Option #2 Definately looks easier for your common or
garden prototyper. This was what INMOS saw when they
embarked upon the RMC before SGS bought them. A friend
of mine whom I lived with for a few months was one of
the guys behind that particular re-usable core.
> that is constrained to use what already exists to spend 75% of its
> resources bypassing problems that simply would not exist if it
> could have started from scratch.
I do accept your point about unnecessary baggage being
a strong risk, but that really should be taken into
account when selecting your core for the job. This is
why there are lots of different shapes and sizes of
core out there.
> The point is that you should always start off VERY, VERY simply,
> because (a) you may get it roughly right, (b) you will have to
> add complications to fix up nasty cases and improve performance
> and (c) the second version will be bigger than the first. The
> IA-64 project broke this rule, and the transputer one didn't.
The Transputer appears to have been designed by a small
group of people having fun and who did it either because
they cared passionately about it or for the sheer hell
of it.
As a remotely related aside, and hopefully one that won't
lead to any horse-heads in my bed... The in-house CAD system
was called "Fat Freddy" (it's a comic that revolves around
a malignant tom-cat, some hippies and cannabis), it was
written in BCPL (Colin Whitby-Strevens worked for INMOS).
The CAD group did a rather dubious photo of one of them sat
at one of those Fat Freddy machines, stark bollock naked
grinning insanely. No prizes where they got that sick idea
from, I still have nightmares about it. Come to think of
it does anyone know if that infamous photo survived ? :)
Cheers,
Rupert
Yes and no. Firstly, "very often" does not mean "usually". What
I was disputing was it is "almost certainly" cheaper to reuse.
I would be happy with "usually"!
I fully agree that you should always CONSIDER reusing, but that
doesn't mean that it should be an almost preordained conclusion.
Also, there are many other options, such as:
3) Reuse the ALUs, FPUs, etc. but rethink the way that they are
plugged together.
For example, starting from the position that either HP or Intel
was with the IA-64, there is no reason not to consider a totally
interrupt-free design. Did they? If not, why not? It certainly
could be done - subject to the (mild) constraints I have described
before.
Regards,
Nick Maclaren.
Well you should have said that you felt I was over-stating
my case. I certainly did *not* say it was a pre-ordained
conclusion either, the "almost" gives me a get-out clause. :)
> 3) Reuse the ALUs, FPUs, etc. but rethink the way that they are
> plugged together.
Over the past few years the impression I've formed about
core design is that the ALU/FPU design is relatively
trivial compared to the instruction scheduling logic.
Therefore reusing these things might not save you much
overall (that's not a reason to eschew re-using them of
course).
> For example, starting from the position that either HP or Intel
> was with the IA-64, there is no reason not to consider a totally
> interrupt-free design. Did they? If not, why not? It certainly
> could be done - subject to the (mild) constraints I have described
> before.
Pfft who knows. Perhaps they foresaw some implementation
issues with an interrupt-free design, perhaps not with
Mk.1, but maybe later. I'd rather not try and out-guess
those guys, besides I suspect they had more political crap
than usual to handle as well. :/
Cheers,
Rupert
You have a point! Sorry.
Regards,
Nick Maclaren.
...
> >Bandwidth x distance is expensive. Lots of bandwidth over
> >short distance is not very expensive. Lots of bandwidth across
> >small distances on a modern chip is very very very cheap.
> >
> That's why streaming architectures are attractive, and the point that
> everyone seems to be missing, or at least that no one bothers to
> acknowledge, is that it is movement of data *on the chip* that is
> expensive--in terms of power consumption. I summarized the argument
> in another post here, and I'm not going to summarize it again. It's
> in the sc2003 merrimac paper
>
> http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf
>
> >For highly partitionable problems, the cost-effective
> >optimum partition size can be analyzed and modeled by
> >looking at the costs of transmitting partition cell
> >edge state to neighbors versus storing/calculating it
> >locally. For given problems and chip technologies there
> >are different optimizations.
> >
> >By putting a large number of tiny but moderately powerful
> >CPU/Custom FP units per ASIC chip, they are getting very
> >good average neighbor to neighbor bandwidth. Or at least
> >can do so and presumably did. Going off-chip to the
> >neighbors on a circuit board hurts, but again is subject
> >to cost / technology optimization, along with the
> >CPU capacity per unit and bandwidth internally...
> >
> You really do need to read the paper.
All right, I took you at your word, and have the following observations:
1. Existing local per-core (L1, and sometimes L2) caches certainly go a
long way toward addressing the paper's observation that communication
(between storage and computational unit) can take far more energy than
computation. For their example where sending 3 64-bit data elements about
15 mm. (from a large chip-wide cache like Itanic's L3 or POWER4+'s L2) took
20x as much power as performing a single computation on them, a local L1
cache with a mere 95% hit rate (hardly unrealistic for the kind of tuned
code being considered in the paper) would reduce that ratio to much more
like 1:1 on average (or perhaps 2:1 if the L1 communication distance
contributes a non-negligible amount; a local L2 such as Itanic's should help
ensure decent local hit rates and energy consumption in architectures that
employ very small L1s) - and such local caches scale down with smaller
feature sizes to preserve this relationship.
2. The paper makes the far-too-common error of comparing current
general-purpose technology with future special-purpose vaporware.
Multi-core designs like Sun's 8-core Niagara (expected in 2005-6) and
Intel's reportedly 8-core Tanglewood (expected in 2006-7) are a major step
toward large per-chip numbers of computational units - especially to the
degree that each core has multiple such units and some form of
multi-threading to exploit them when significant workload parallelism is
present. These chips will almost certainly retain the local per-core caches
described above and hence offer potentially good
computation-vs.-communication power (and performance) characteristics; if
continued decreases in feature sizes start to make the local cache hit-rate
insufficient to mask chip-wide-cache latency (and communication power
consumption), the obvious solution is to break up the chip-wide cache into
separate, closer shared caches for subsets of the cores, and communicate
among these subsets just as multi-core chips communicate off-chip. The Sun
and Intel designs will also presumably employ sequential prefetch
optimizations at least between chip-wide cache and main memory, so the kind
of streaming access to main memory described in the paper should be possible
for the kind of specialized code it describes (and at least for
architectures that support prefetch hints this presumably would, or at least
potentially could, carry the data right through to the core-local caches).
I'm not going to spend enough time studying this to understand the detailed
nature of the specialized support in the proposed chip for pipe-lining
computations from one core to the next (rather than within a single core, as
is currently common) without the latency inherent in having the data pass
through a chip-wide cache: I do see some hand-waving about
better-than-chip-wide SRF locality, but exactly how this would be exploited
in such a computation (which at a minimum would appear to require knowledge
of which cores were near neighbors and which were not) is not clear at first
(quick) glance.
3. And of course the most important point is that the Sun and Intel designs
address not only specialized HPTC needs but significant commercial needs as
well - so they will be both relatively inexpensive due to highish volume and
well-maintained over time, neither of which is likely with a far more
specialized chip such as the one the paper proposes. The paper's estimate
of a $200 cost is laughable to the point of being absurd: just blindly
migrating the design to each new process generation as it came along, with
no enhancements to take advantage of the additional opportunties that opened
up, would likely make it far more expensive than that for the volumes that
it would likely generate, leaving aside both initial development costs and
marginal production costs. No company in their right mind would be likely
to touch this without the expectation of a market price approaching 5
figures to justify the effort and development risk.
IOW, I'd suggest that for the problem-space described people might well be
better off figuring out how best to use the commercial processors already in
the development pipeline to best effect - and that these designs might well
at least approach the capabilities of the special-purpose hardware that was
described (don't forget, they'll be clocked many times as fast as the target
described for that hardware). Unless I greatly over-estimate the expense of
creating large, complex, high-performance chips (and the recent experiences
with Itanic would seem to suggest otherwise), doing so for a small,
special-purpose market such as the one described seems to make no sense at
all.
- bill
What makes you think that Blue Gene *isn't* a streaming
architecture? Or a systolic one, or whatever? As long as the
"do something with it" part is sufficiently general-purpose, it
comes down to how you program the thing.
Sure, any of these cluster-in-a-box systems, particularly the
cache-coherent ones, can also be programmed as (approximations
of) conventional SMP systems (in the queuing theory sense), but
we all know where the bottlenecks are, and how best to push the
processors towards peak throughput: minimise communications.
The trick is formulating the algorithm for arbitrary problem X
to suit. I don't see that it's a failure of any sort to try to
get to "there" from "here" in a comfortable, gradual way.
Legacy code and legacy coding practices exist.
--
Andrew
Not heard of forwarding? One reason why the heavily
out-of-order CPUs of today get away with as few register ports
as they do. Values flit from one FU to the next just fine.
--
Andrew
<snip>
>All right, I took you at your word, and have the following observations:
>
>1. Existing local per-core (L1, and sometimes L2) caches certainly go a
>long way toward addressing the paper's observation that communication
>(between storage and computational unit) can take far more energy than
>computation. For their example where sending 3 64-bit data elements about
>15 mm. (from a large chip-wide cache like Itanic's L3 or POWER4+'s L2) took
>20x as much power as performing a single computation on them, a local L1
>cache with a mere 95% hit rate (hardly unrealistic for the kind of tuned
>code being considered in the paper) would reduce that ratio to much more
>like 1:1 on average (or perhaps 2:1 if the L1 communication distance
>contributes a non-negligible amount; a local L2 such as Itanic's should help
>ensure decent local hit rates and energy consumption in architectures that
>employ very small L1s) - and such local caches scale down with smaller
>feature sizes to preserve this relationship.
>
Computation costs are decereasing much faster than communication
costs. Whatever the relationship is today, 1:1 or 2:1, it will have
changed by a factor of four in five years, if you take communication
costs to scale as L and computation costs to scale as L-cubed and
assume (as the authors do) that L is halved every five years.
That still might not change the conclusion. As a peg in the ground,
I'd arbitrarily choose a factor of 10 improvement in energy
consumption as the point where it's worth considering a highly
specialized architecture. 32 kB of L1 cache is 4K double precision
words, or 64 sixty-four word vector registers. If I can actually
manage the L1 cache that way and I have a 1-cycle access latency to
L1, then, sure, I can use Itanium like a streaming processor. Can I
manage L1 cache that way?
>2. The paper makes the far-too-common error of comparing current
>general-purpose technology with future special-purpose vaporware.
That's an error? I thought it was a standard marketing technique.
>Multi-core designs like Sun's 8-core Niagara (expected in 2005-6) and
>Intel's reportedly 8-core Tanglewood (expected in 2006-7) are a major step
>toward large per-chip numbers of computational units - especially to the
>degree that each core has multiple such units and some form of
>multi-threading to exploit them when significant workload parallelism is
>present. These chips will almost certainly retain the local per-core caches
>described above and hence offer potentially good
>computation-vs.-communication power (and performance) characteristics; if
>continued decreases in feature sizes start to make the local cache hit-rate
>insufficient to mask chip-wide-cache latency (and communication power
>consumption), the obvious solution is to break up the chip-wide cache into
>separate, closer shared caches for subsets of the cores, and communicate
>among these subsets just as multi-core chips communicate off-chip.
We, of course don't know how those CPU's will communicate. If it's
through a global L-3 cache, then we are back at the 20x energy cost
ratio today or 80x cost in 5 years. All kinds of strategies are
possible, but I doubt if core to core streaming is one that's in the
works. This is the point at which your defense of conventional
architectures seems to fall apart, but perhaps I am just being dense,
and you can enlighten me.
>The Sun
>and Intel designs will also presumably employ sequential prefetch
>optimizations at least between chip-wide cache and main memory, so the kind
>of streaming access to main memory described in the paper should be possible
>for the kind of specialized code it describes (and at least for
>architectures that support prefetch hints this presumably would, or at least
>potentially could, carry the data right through to the core-local caches).
For once I'm not going to obsess over access to main memory. In this
particular discussion, that's a detail, as far as I'm concerned.
>I'm not going to spend enough time studying this to understand the detailed
>nature of the specialized support in the proposed chip for pipe-lining
>computations from one core to the next (rather than within a single core, as
>is currently common) without the latency inherent in having the data pass
>through a chip-wide cache: I do see some hand-waving about
>better-than-chip-wide SRF locality, but exactly how this would be exploited
>in such a computation (which at a minimum would appear to require knowledge
>of which cores were near neighbors and which were not) is not clear at first
>(quick) glance.
But you just hand-waved away the whole point of a streaming
architecture.
>3. And of course the most important point is that the Sun and Intel designs
>address not only specialized HPTC needs but significant commercial needs as
>well - so they will be both relatively inexpensive due to highish volume and
>well-maintained over time, neither of which is likely with a far more
>specialized chip such as the one the paper proposes. The paper's estimate
>of a $200 cost is laughable to the point of being absurd: just blindly
>migrating the design to each new process generation as it came along, with
>no enhancements to take advantage of the additional opportunties that opened
>up, would likely make it far more expensive than that for the volumes that
>it would likely generate, leaving aside both initial development costs and
>marginal production costs. No company in their right mind would be likely
>to touch this without the expectation of a market price approaching 5
>figures to justify the effort and development risk.
>
Absolutely. A subtext to this discussion is that processors that are
up to the kinds of problems they claim Blue Gene is intended to
address aren't going to become available as a side product of
commerical procesor development. That's the DoD's COTS fantasy, and
it's time to give it a decent burial.
How many billions of dollars would it be worth, though, to be able to
do molecular biology reliably on computers?
>IOW, I'd suggest that for the problem-space described people might well be
>better off figuring out how best to use the commercial processors already in
>the development pipeline to best effect - and that these designs might well
>at least approach the capabilities of the special-purpose hardware that was
>described (don't forget, they'll be clocked many times as fast as the target
>described for that hardware). Unless I greatly over-estimate the expense of
>creating large, complex, high-performance chips (and the recent experiences
>with Itanic would seem to suggest otherwise), doing so for a small,
>special-purpose market such as the one described seems to make no sense at
>all.
I certainly don't think you are, but I still don't believe we should
be building supercomputers with COTS proceesors.
RM
>On Wed, 26 Nov 2003 22:55:34 -0500, Robert Myers wrote:
>> In a streaming architecture, you fetch data from somewhere, you do
>> something with it, you do something else with it, you do something
>> else with it, you do something else with it, ad nauseum, until you
>> have absolutely run out of operations you can stream together.
>> *Then*, and with great reluctance, you bear the cost of putting the
>> data where you will need to bear the cost of fetching it again.
>
>What makes you think that Blue Gene *isn't* a streaming
>architecture? Or a systolic one, or whatever? As long as the
>"do something with it" part is sufficiently general-purpose, it
>comes down to how you program the thing.
>
Well, no it doesn't. It comes down to what kinds of facilities for
interprocessor communication exist, and I've heard nothing about Blue
Gene to indicate that processor to processor streaming is possible.
If it is, perhaps someone will provide a link or a clue.
>Sure, any of these cluster-in-a-box systems, particularly the
>cache-coherent ones, can also be programmed as (approximations
>of) conventional SMP systems (in the queuing theory sense), but
>we all know where the bottlenecks are, and how best to push the
>processors towards peak throughput: minimise communications.
>The trick is formulating the algorithm for arbitrary problem X
>to suit. I don't see that it's a failure of any sort to try to
>get to "there" from "here" in a comfortable, gradual way.
>
If you need to haul data a significant portion of the chip width to
use it once or twice, you're going to get killed on energy costs
relative to a true streaming architecture. You can't program that
away.
>Legacy code and legacy coding practices exist.
Yes, and if some people had their way, apparently we'd still be
writing x86 code well into the next century.
I don't know in what area you work or have worked, but HPC codes are
constantly being rewritten to accommodate new architectures.
RM
>For example, starting from the position that either HP or Intel
>was with the IA-64, there is no reason not to consider a totally
>interrupt-free design.
How would you do I/O and pre-emptive multitasking without interrupts?
--
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
http://www.esatclear.ie/~rwallace
...
Unless I greatly over-estimate the expense of
> >creating large, complex, high-performance chips (and the recent
experiences
> >with Itanic would seem to suggest otherwise), doing so for a small,
> >special-purpose market such as the one described seems to make no sense
at
> >all.
>
> I certainly don't think you are, but I still don't believe we should
> be building supercomputers with COTS proceesors.
How can you possibly justify such a statement without performing the kind of
cost/performance analysis I made at least a very rough stab at? It would
seem obvious that there is at least a *possible* range of costs implicit in
creating new architectures to support supercomputing for which using COTS
processors instead is eminently more sensible, because while they may offer
only a small percentage of the efficiency of a custom implementation they
may do so at sufficiently low cost to make them the better choice.
The only areas that come to mind where it might be worth, say, a 50x
increase or more in cost to obtain a 10x improvement in performance are
time-critical computations such as weather forecasting (and my vague
impression is that this particular activity may be at least somewhat
amenable to simply increasing the number of processors you throw at it) and
real-time automated battle analysis. In other areas, I suspect (with very
little acquaintance with HPTC to go on) that very often either you can throw
more (inefficient) general-purpose processors at the problem or you can be
just a bit more patient (because your time just isn't worth as much as the
hardware would cost - at least if you manage that time sensibly and divert
yourself to other useful work while waiting for the result).
As far as global use of limited resources, I tend to sympathize a lot more
with scientists who can't obtain the hardware required to get new
information at all than with those who are merely anxious to get it faster.
And, for that matter, with many other non-scientific funding needs. So it
seems just fine to me to let the free market decide whether you should get
the hardware you're asking for.
- bill
You *have* to program it away. If your program only knows how
to do one thing to each piece of data, then it doesn't matter
how "streaming" the architecture is.
>>Legacy code and legacy coding practices exist.
>
> I don't know in what area you work or have worked, but HPC codes are
> constantly being rewritten to accommodate new architectures.
If HPC codes could be written to accommodate streaming
architectures, then you'd already be seeing 100% FU utilisation
on the existing commodity-processor-based MPP HPC systems. You
can get (close to) that for some benchmarks (linpack), but that
doesn't seem to be the general case. The problem is a software
one (at least to a first order of approximation, and on todays
hardware).
My point is that it's all very well to extrapolate hardware
costs out five years and reach certain conclusions, as these IBM
guys have done, but if you still don't know how to get hardware
in that shape to do what you want to do, then its not very
useful, however energy efficient it is.
Perhaps we just have to pay the cost of data movement, until we
figure out how to avoid it.
People hate paying for things that they don't need to, so if
there's an alternative, it will be found.
--
Andrew
>
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:bt4dsvkj4eodnq3as...@4ax.com...
>> On Thu, 27 Nov 2003 17:02:07 -0500, "Bill Todd"
>> <bill...@metrocast.net> wrote:
>
>...
>
> Unless I greatly over-estimate the expense of
>> >creating large, complex, high-performance chips (and the recent
>experiences
>> >with Itanic would seem to suggest otherwise), doing so for a small,
>> >special-purpose market such as the one described seems to make no sense
>at
>> >all.
>>
>> I certainly don't think you are, but I still don't believe we should
>> be building supercomputers with COTS proceesors.
>
>How can you possibly justify such a statement without performing the kind of
>cost/performance analysis I made at least a very rough stab at? It would
>seem obvious that there is at least a *possible* range of costs implicit in
>creating new architectures to support supercomputing for which using COTS
>processors instead is eminently more sensible, because while they may offer
>only a small percentage of the efficiency of a custom implementation they
>may do so at sufficiently low cost to make them the better choice.
>
C'mon, Bill, if I have the nerve to call a program that got by high
level review panels, the Congress, and the President, and that the
American people stood by and waved their flags for "just plain
stupid," I'm obviously the sort of person who thinks he can get away
with saying anything. ;-).
Some sort of by-play has been going on between the powers that be and
companies like IBM:
Powers-that-be: "We need faster processors."
Companies like IBM: "We can build them. Just tell us how to pay for
them."
Powers-that-be: "You know. The same way you pay for everything else
you design and build."
Companies like IBM: "We build processors for which we can find lots of
buyers. The kinds of processors you want us to build won't have lots
of buyers."
Powers-that-be: "Well we need those processors. How do we get them
built?"
Companies like IBM: "Get out your checkbook."
>The only areas that come to mind where it might be worth, say, a 50x
>increase or more in cost to obtain a 10x improvement in performance are
>time-critical computations such as weather forecasting (and my vague
>impression is that this particular activity may be at least somewhat
>amenable to simply increasing the number of processors you throw at it) and
>real-time automated battle analysis. In other areas, I suspect (with very
>little acquaintance with HPTC to go on) that very often either you can throw
>more (inefficient) general-purpose processors at the problem or you can be
>just a bit more patient (because your time just isn't worth as much as the
>hardware would cost - at least if you manage that time sensibly and divert
>yourself to other useful work while waiting for the result).
>
I was just about to say, "You really need to read...," but I already
pulled that stunt in this thread.
There are forty million people worldwide infected with HIV. Five
million became infected in the last year according to the Washington
Post.
Computers won't necessarily find a cure or a vaccine for AIDS, but
computers could revolutionize molecular biology the way they
revolutionized aerodynamics. We are just at the point of being able
to contemplate the undertaking, but we need much more powerful
computers. We need to start trying harder, and we need to stop being
cheap about it.
>As far as global use of limited resources, I tend to sympathize a lot more
>with scientists who can't obtain the hardware required to get new
>information at all than with those who are merely anxious to get it faster.
>And, for that matter, with many other non-scientific funding needs. So it
>seems just fine to me to let the free market decide whether you should get
>the hardware you're asking for.
>
Having brought up the worldwide AIDS epidemic, I'm not going to make
the obvious retort, but it begins with "Well, then, why don't we just
let the free market decide...", and I think you know how it would end.
RM
>On Fri, 28 Nov 2003 00:03:01 -0500, Robert Myers wrote:
>> If you need to haul data a significant portion of the chip width to
>> use it once or twice, you're going to get killed on energy costs
>> relative to a true streaming architecture. You can't program that
>> away.
>
>You *have* to program it away. If your program only knows how
>to do one thing to each piece of data, then it doesn't matter
>how "streaming" the architecture is.
>
I have no idea what fraction of codes could be successfully rewritten
to exploit a streaming architecture, but if the architecture forces
you into the get it, do something, put it back mode, you can't program
that reality away. If you *do* have a streaming architecture
available, you still have to figure out how to take advantage of it.
If that's your point, of course I agree.
>>>Legacy code and legacy coding practices exist.
>>
>> I don't know in what area you work or have worked, but HPC codes are
>> constantly being rewritten to accommodate new architectures.
>
>If HPC codes could be written to accommodate streaming
>architectures, then you'd already be seeing 100% FU utilisation
>on the existing commodity-processor-based MPP HPC systems. You
>can get (close to) that for some benchmarks (linpack), but that
>doesn't seem to be the general case. The problem is a software
>one (at least to a first order of approximation, and on todays
>hardware).
>
The Burger King in my hometown has a sign near the place where they
keep the french fries: "It's okay to waste french fries." They'd
rather throw out french fries that are not at the peak of perfection
than have customers get turned off to one of their most profitable
menu items.
We need to start learning, "It's okay to leave execution units idle,"
in the sense that maximum exploitation of execution units is no longer
the desiteratum of computing. Moving data around with the goal of
maximizing use of execution units is the computational equivalent of
selling stale french fries.
>My point is that it's all very well to extrapolate hardware
>costs out five years and reach certain conclusions, as these IBM
>guys have done, but if you still don't know how to get hardware
>in that shape to do what you want to do, then its not very
>useful, however energy efficient it is.
>
The Merrimac team is from Stanford, not IBM.
A journey of 1000 miles begins with one step. As long as all our
effort is aimed at processors built on processors that maximize the
wrong thing, that's where we will remain stuck.
RM
A very simple front-end processor, like most mainframes used to employ?
Most of communication would be polled.
I'd still like a hw method to do task switching though, and timer
interrupts seems to work.
Likewise, it is sometimes nice to be able to say: "Hey, stop whatever
you're doing, this stuff over here is _really_ critical."
If you offload all of this as well, your main cpu ends up as a fancy
vector coprocessor, right?
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
You still don't. Your arguments against winged reusability are
not well informed.
>gher...@gw.retro.com (George William Herbert) wrote:
>>[...]
>
>There may be no law of nature, but the engineering constraints are
>pretty obvious.
>
>One of the proposals for providing a lifeboat for present-day
>astronauts is to pull one of the Apollo vehicles out of a museum,
>modernize the electronics, put a new ablative heat shield on it, and
>fly it again.
>
>You can't put a new ablative heat shield on the shuttle because there
>isn't one. If you allow a surface that you later intend to use like a
>wing to cool iself by ablation, you are left with the aerodynamics of
>whatever shape the erosion that occurred during reentry produced.
Yes, but wings which aren't perfectly even and smooth can fly
safely within certain parameters; wing icing is not immediately
lethal to aircraft, for example, though it is not friendly.
The Shuttle's detailed design and its wing loading / planform
etc are poorly optimized for the results of such a surface
roughening but generalizing to all winged RLV designs is
not reasonable.
If the surface ablation on the leading edge leads to unacceptable
roughness given your ablator and flight profile then an ejectable
ablator covering the leading edge is quite easy to engineer.
>You can't even allow the surface to warp and ripple unpredictably, which
>is why we have gaps in these foam tiles that are glued onto felt.
The tiles are both stiff and expand noticably when heated, while
the underlying structure is stiff and does not expand much because
it is not heated; the differential expansion would cause cracks
at peak re-entry heating if the tiles were a single solid sheet.
Surface warping is not critical. Note that shuttle missions
have suffereed hundreds of tile losses or damaged tiles before
and not had any negative effects of note prior to the thermal
damage done when a whole leading edge RCC panel failed on Columbia
(we think).
>When the man on the street hears about the design details of shuttle
>tiles, his common sense reaction is, "You've got to be kidding me."
>It all comes from the impossible constraints produced by trying to put
>an airplane wing through re-entry.
The constraints are not impossible.
I am not a winged RLV fan, but your claims are not a fair
or accurate insight into the state of the art.
Nobody can do reasonable trade studies without
having an accurate view of the competing concepts
real feasibility, and your vehement rejection of
wings is not accurate or reasonable.
>The aerodynamics of a capsule are pretty simple, and you can allow the
>surface that takes the heat to ablate. It works, it is very
>forgiving, and the aerodynamic surface you are counting on to get you
>back to sea level at a reasonable speed is a parachute that was safely
>stowed during reentry.
Yes. Which is why I like capsules.
> By comparison, those damn foam tiles were
>falling off faster than they could glue them back on at one point.
>
>http://www.house.gov/science/hearings/space03/may08/myers.htm
>[...]
The adhesion problems were solved to an acceptable level
years and years and years ago.
Tile adhesion was not adequately analyzed and engineered
during shuttle development, true. Function of the schedule
and budget constraints and lack of any serious experimental
program to test the technologies.
>[snipped quote from Apollo hardware reuse for NASA's
> Orbital Space Plane (OSP) program]
>
>As it is, Mr. Myers (again, no relation) is being very kind about
>those fragile wings. You can put more engineering effort into the
>foam insulation on the external fuel tank, but shuttle tiles have
>fallen off in previous flights, and they will continue to fall off if
>the current shuttle thermal management system continues to be used.
And they will continue to be a minor issue if the rates of tile
loss are kept to historical levels. The design was done with
the assumption that they'd lose tiles; so far, the tile losses
haven't affected anything other than the turnaround time having
to put replacement tiles back on.
>>The comments regarding pilots and flying things are off
>>base as well. To reuse a large vehicle it has to be
>>flown to a relatively low velocity pinpoint landing.
>
>There is no reason that I know of that an Apollo-type system cannot be
>used to meet all future needs of the US space program. Vehicles like
>the shuttle never have and, in my opinion, never will get us beyond
>LEO. The real reason capsules won't be given serious consideration is
>the one I mentioned. The blue suits want an airplane. Have an
>apoplectic fit if you like, but that's how it works.
I'm having to stop banging my head on the wall to finish my
post here.
Robert:
1) I design manned capsules, and not winged RLVs.
2) I had some discussions about participating in OSP.
3) If it were not for a family emergency I would have presented a
paper on capsule OSPs at the Space Access conference last year.
4) Though I am not participating in OSP, I know people on the
engineering teams that are.
5) Capsules were under consideration from day one.
6) Capsules are *most* of the designs that have survived the
initial trade studies, and extremely positive comments about
them have come out of both NASA and all the vendors.
7) The former head of the astronaut office, now working for one
of the OSP vendors, is one of the loudest capsule proponents.
8) The odds of the OSP that flies being a capsule are very very high,
at this point.
9) All of that said, I do not judge that you are fairly or
reasonably assessing the advantages and disadvantages of
wings on re-entry vehicles. You are not showing any awareness
of the alternative thermal protection systems or the actual
detailed issues associated with the RCC/Tiles/blankets thermal
protective system on the Shuttle, and have gotten a bunch of
the details wrong on the tiles.
I do not like wings because they are expensive and heavy;
your arguments against them are kneejerking.
-george william herbert
gher...@retro.com
I am still not getting this.
The criticism was relating to the Blue Gene box,
which is not a general purpose system moving large
amounts of data globally on-chip. It is a system with
large quantities of very small CPUs with associated
FPUs and local memory on chip, which don't need much
global routing of data across the chip for well partitioned
problems (and, poorly partitioned problems are bad matches
for the sea-of-processors solution anyways).
>>For highly partitionable problems, the cost-effective
>>optimum partition size can be analyzed and modeled by
>>looking at the costs of transmitting partition cell
>>edge state to neighbors versus storing/calculating it
>>locally. For given problems and chip technologies there
>>are different optimizations.
>>
>>By putting a large number of tiny but moderately powerful
>>CPU/Custom FP units per ASIC chip, they are getting very
>>good average neighbor to neighbor bandwidth. Or at least
>>can do so and presumably did. Going off-chip to the
>>neighbors on a circuit board hurts, but again is subject
>>to cost / technology optimization, along with the
>>CPU capacity per unit and bandwidth internally...
>>
>You really do need to read the paper.
The paper isn't addressing this at all.
It's asserting in section 2 that global wires are
a bad thing (or, more precisely; slow, high power
consumption high area thing), which anyone who's
looked at chip design knows to be a true statement.
But global wires are a feature of monolithic
chips which are laid out in a geometrically
constrained manner. Chips with a sea of similar
subcomponents which mostly talk within their
functional blocks aren't using global wires.
When they do have to exchange data, if the data
set partitioning is done well to match the system
then the data exchange between functional blocks
is short distance (order of internal block wire
lengths) neighbor to neighbor. A 2-d partitioned
problem done that way is very friendly to that
system architecture. A 3-d partitioned problem
done that way, with each functional element computing
a 1-D slice of the problem, is also quite
friendly to that system architecture.
This is not criticism of the paper's preferred
architecture; I am still cogitating on that.
My point is this: a Blue Gene -like system done
right is not using the long on chip wires that
the authors of the Stream paper correctly point
out is a high power and area and time impact on
monolithic chip design. Using their comments
on the negative impact of long wires to criticise
a computing system which doesn't use them much
is a nonsequiteur.
>>This is not hard. This is actually computer architecture
>>childs play to rough sketch out. What's hard is finding
>>and formulating problems that play well with such architectures.
>>I have been told that proteins and 3-D FEM are very amenable
>>to such work, by people doing protein work commercially
>>and based on hints coming out of the places doing nuclear
>>weapons simulation (or so I think... they like being semi-
>>opaque...).
>>
>The protein folding stuff may all just be misdirection.
Some people are paying Big Money to do protein folding work,
regardless of what the nuclear weapons labs are buying.
-george william herbert
gher...@retro.com
> Computers won't necessarily find a cure or a vaccine for AIDS, but
> computers could revolutionize molecular biology the way they
> revolutionized aerodynamics. We are just at the point of being able
> to contemplate the undertaking, but we need much more powerful
> computers. We need to start trying harder, and we need to stop being
> cheap about it.
I would have to disagree. My PhD was in protein structure prediction
and now I work in genomics and statistical genetics. Revolutionary
advances in these fields are going to come from new algorithms, not
from running existing algorithms on faster computers. Faster
computers sometimes do help, of course. But I think throwing money at
the hardware side of the problem now is a waste, since we don't really
know yet what the "right" algorithms will look like.
-- Dave
I/O is trivial; interrupts always were a bad way of doing it, and many
older systems did it other ways. Consider, for example, I/O requests
that don't block and I/O responses that are put into a queue for
handling when the driver next gets around to it. FIFOs are not exactly
unusual in hardware :-)
Alternatively, you can go for a seriously multi-threaded approach and
have all I/O operations blocking, so you need to spin off a new thread
for each one. Not a major issue if you have enough threads available,
but it needs a lot of hardware threads - which is perfectly possible
on modern chips.
Multitasking was done without interrupts on quite a few of the earlier
machines - what you mean is preemptive multitasking (i.e. not the
coroutine approach). Each process yields control when it gets to
suitable points, which are required to be no further apart than a
certain time. Apple took that approach for a long time, and its
only problem is that a failing program can block the hardware thread.
My solution to that was an architectural requirement for a 'yield'
instruction to be called every (say) M instructions or N memory
references, whichever comes first, and to abort the process if it
failed to do so. Dead easy to implement.
TLB misses are similarly easy to deal with, and machine checks are
separated into 'process abort' ones and ones fixed up in hardware
and queued to a separate thread for logging. Nobody sane should
handle timers/clocks by interrupt.
So where's the problem? :-)
Regards,
Nick Maclaren.
The Jenkins book and related Shuttle program technical
histories demonstrate that the scalebacks materially affected
the reliability, maintainability, reusability, and margins
of the resulting vehicle.
>> I am an aerospace engineer and have designed re-entry
>> vehicles (not that have flown, but done concept design
>> work) and I have *no* idea what you are talking about
>> regarding the lifting body shape not being 'safe'.
>> There are tradeoffs from ablative thermal protection
>> systems (heavier than tiles and metallic shingles and
>> the like, but can withstand higher peak loads) that
>> make them work better with capsules. But there is no
>> law of nature that reusable thermal protection is unsafe
>> or that lifting re-entry / lifting bodies are somehow
>> inherently unsafe.
>
>To say that there's no known such law of nature avoids addressing the
>reality that with today's technology (let alone the 30-year-old technology
>used in the Shuttle) we seem to know how to create considerably safer
>ablative shielding than reusable thermal protection. Until that situation
>changes, the conclusion stands.
You're saying this like there aren't any failure modes for
ablative heatshields, Bill, and that's not even vaguely true...
They're large, thin composite structures engineered to vaporize
in a controlled manner not be as strong as possible. There have
been cracks in them in processing, due to damage (Apollo 1) and
there were very real fears in Apollo 13 that the heat shield was
busted by the tank explosion. They are not easily testable;
you can test reusable TPS incrementally, but ablators are one
use only items.
Again, I say these things as an ablator fan and a capsule fan;
I like using them, but doing so responsibly requires knowing
the tradeoffs, and the limitations of the technology not just
its advantages. Application of hubris to aerospace technology
often leads to fatalities, even with generally well known
and well liked and robust technologies.
> I say this as a capsule bigot and
>> someone who pushed very hard for OSP to consider capsules.
>> Capsules are cheaper.
>
>Exactly. So the primary argument for the shuttle - that it would
>dramatically *lower* the costs of placing and maintaining objects in orbit -
>evaporates. The Russian concept of 'big dumb boosters' may not have pushed
>the frontiers of several areas of technology that the Shuttle pioneered, but
>as a means to reach and work in space cost-effectively it seems to have won
>hands-down.
A reusable vehicle, to cost less, has to have lower per-flight costs,
AND has to fly often enough to amortize its greater development
costs over the flights it flies.
Shuttle was backed into the corner of the trade space where it
wasn't going to be reusable enough and robust enough to be
able to fly often enough to do that. By having its development
budget axed, and then not doing a proper top level systems
capabilities requirements reassessment to see if the amount of
money available really was enough to do RLV or not.
>> Safety can be done right or wrong
>> with either capsules or lifting bodies.
>
>While that is certainly true in a relative sense, when compared on an
>absolute basis capsules seem considerably safer given current technologies.
There have been four fatal flight accidents in manned spaceflight
history (ignoring for the moment ground training and such);
Soyuz 1 and 11, Challenger and Columbia.
One was launch phase, two were re-entry, one was a re-entry failure
due to a launch phase incident.
Soyuz 1 was a parachutes failure. Those nice, reliable parachutes
that everyone keeps saying never fail? Oh, and we also lost one
of them on an Apollo, though the redundant units were enough to
prevent any serious mishap that time.
Soyuz 11 was a systems failure (stuck valve depressurized the
craft during re-entry) and was unrelated to the vehicle's
shape and re-entry mode.
Challenger was the ascent loss. Shuttle wasn't designed to
survive certain failure modes during ascent, due to budget
cutbacks forcing them to a design space where they couldn't
solve some of the technical problems reasonably. No reasonable
thrust termination and escape rocket systems were likely
possible once the large SRBs were baselined. Russia has
had two launch accidents with Soyuz which were not fatal.
A better done winged vehicle wouldn't necessarily be
dangerous like Shuttle is, though.
Columbia was... complicated. A lot of details specific
to the engineering of the Shuttle all collided in negative
ways at once.
>> The comments regarding pilots and flying things are off
>> base as well.
>
>Really? I could imagine that the rationale owed as much to the potential
>for developing related military technology (where the benefits of an ability
>to operate in both atmosphere and vacuum are obvious - the same is also true
>of single-stage-to-orbit vehicles) as to a simple 'fly-boy' mentality, but
>that's about it: there is no obvious rationale whatsoever for the
>atmospheric capabilities of the Shuttle (or SSTO) for NASA use, and several
>good cost and safety arguments against it.
The reason for the wings configuration is crossrange on re-entry.
High hypersonic lift to drag. NASA wanted lower hypersonic L/D
and lower crossrange; the DOD wanted higher crossrange so they could
loft certain payloads into polar orbit and return to land in one
orbit if they had to, and certain related mission issues.
Fundamentally, again: if you're going to reuse the vehicle
it has to land softly and to land softly it has to be
able to land on specific points and at specific attitudes.
Reusing a capsule which lands blunt end down on random
soil may not be possible, even if you want to.
Salt water is very hard on aerospace equipment.
To reuse you need to either make very rugged equipment,
or land it softly with wings or with rocket motors.
> To reuse a large vehicle it has to be
>> flown to a relatively low velocity pinpoint landing.
>
>Reuse may be over-rated (at least it certainly never operated to reduce
>Shuttle costs below that of its non-reusable competition), or the definition
>biased. The only part of the system that usually needs to return to Earth
>is the crew compartment, and reuse of that (minus its ablative shield)
>should be relatively easy - especially if the landing uses water to cushion
>the impact on the structure.
Water immersion generally voids the warranty on most aerospace
alloys. Jets that skid off the runway into the ocean, even if they
just get lightly soaked, are junked. Same thing with capsules.
Stainless steel may not be so subject to it, and Titanium isn't so
bad, but most capsule structures are aluminum.
>Given the difficulty of putting mass into even low Earth orbit, putting any
>more up there than is necessary seems silly. The Shuttle structure not only
>contains a large cargo bay (which usually returns to Earth empty) but also
>its heavy and powerful main engines (and it must be strong enough to survive
>their thrust) - neither of which is of much use once LEO has been attained.
The engines still being along is a function of the rocket being a
stage-and-a-half vehicle, not it being reusable. Any single stage
or stage-and-a-half rocket has the same problem, going back to
the early Atlas rockets.
The cargo bay is, in fact, not a major weight component of
the overall shuttle structure.
The whole airframe, needed to return all the parts safely
to the ground, *is* a major weight component... an expendable
vehicle with the same basic technologies would save at least
fifty tons over the shuttle's weight.
>As long as propulsion systems and fuel dominate cargo capacity in launch
>vehicles (exactly the reverse of the situation with general-purpose
>aircraft), using purpose-built vehicles built up from standard (and, where
>feasible, reusable) components rather than attempting to create
>general-purpose, reusable vehicles at coarser grain seems to make much more
>sense.
Fuel is cheap. If you ever build a rocket where the fuel bill
is a noticable fraction of the launch cost, you will receive
serious congratulations from the space launch community.
Using nitrogen tetroxide is cheating, however.
Propulsion systems are not cheap, however.
We could go on ad infinitum about big dumb boosters;
I happen to agree that they're a really good technology
option at this place and time, but if the volume of
launches increased by a factor of ten, as it might
if costs dropped down to low hundreds of dollars per
pound launched, then reusable vehicles will likely
be a significantly more cost effective option from
that point onwards.
I say that as someone with man-years of so far unreimbursed
engineering time into some BDB booster designs and tech
analysis, plus two so far failed rounds of commercializing
and capitalizing to get the full scale development program
going to build one. BDBs are great, until they succeed
wildly, and then RLVs will win in the long run.
I hope to help make that happen and make some money
in the decade or so that it takes for that to happen,
if I am lucky.
-george william herbert
gher...@retro.com
No, the point was that if the application has been, or could be, converted
to one supporting a streaming architecture, it would already be done for
existing (high-performance) hardware - precisely because such an approach
makes best use even of _existing_ hardware. Just what I'd been telling you
before, but at the moment, you're not quite willing to listen.
> We need to start learning, "It's okay to leave execution units idle,"
> in the sense that maximum exploitation of execution units is no longer
> the desiteratum of computing. Moving data around with the goal of
> maximizing use of execution units is the computational equivalent of
> selling stale french fries.
That has not been the desideratum (sic) for at least a decade. As I said,
look at the vector-like optimizations for the Ahtlon - they _purposely_
let the FUs run idle for two parts of the algorithm - out of three - in
order to keep computation and communication seperate. And the computation
part is already streaming as much as it can.
Jan
Jan
Columbia's problem, to a large degree, was precisely that the tiles
were _not_ compromised. In fact, in all tests the assessment, during
the mission, that the analysis program (massively) overestimated tile
damage was vindicated. _All_ burn-through problems - and there historically
were some before Columbia's last mission - were associated with the
wing leading edge, RCC and its support structure. The tile really _was_,
as Linda Ham said, "only a maintenance issue". But everybody had thought
about problems with the tiled areas, for various (good, historical)
reasons; very few people had considered possible problems with RCC, and
what little analysis had happened there was deeply flawed.
And, BTW, the wings were anything but fragile: the analysis of in-flight
data showed that Columbia's left wing kept flying until all that was left
was a sagging, hollow shell. You can find traces of the structural engineers'
suprise that it held on for so long even in the quite terse CAIB report.
Jan
<snip>
>
>The reason for the wings configuration is crossrange on re-entry.
>High hypersonic lift to drag. NASA wanted lower hypersonic L/D
>and lower crossrange; the DOD wanted higher crossrange so they could
>loft certain payloads into polar orbit and return to land in one
>orbit if they had to, and certain related mission issues.
>
>Fundamentally, again: if you're going to reuse the vehicle
>it has to land softly and to land softly it has to be
>able to land on specific points and at specific attitudes.
>Reusing a capsule which lands blunt end down on random
>soil may not be possible, even if you want to.
>Salt water is very hard on aerospace equipment.
>To reuse you need to either make very rugged equipment,
>or land it softly with wings or with rocket motors.
>
This isn't the place for a lengthy discussion, and I don't want to get
into an ugly flame war. The sensible thing to do would be to shut up
and to allow you just to keep repeating the elaborate web of nonsense
that has surrounded the shuttle program from the very beginning. I
just can't stand reading it and not responding because I've been
listening to the same arguments and thinking the same things for
decades:
1. The best way for the military to get its satellites into orbit is
with expendable launch vehicles. That's what they've wound up doing
for the most part, anyway.
2. The idea behind reuse was that it was somehow supposed to save
money. It hasn't, and in bucketsful. Your argument comes down to: if
we had spent even more money, we could have saved money safely. Reuse
is a maguffin (a plot device that holds the story together but that
isn't really central to the action, like the Maltese Falcon in the
Maltese Falcon).
3. There are three different missions: manned exploration of space,
putting satellites into orbit, and developing something like the
national aerospace plane. The three missions got jumbled up among
different agencies with different agenda, and I suspect that Bill is
corrrect on this point: the desire to develop cutting edge technology
that would be useful in developing a military vehicle with
partial-orbit capabilities trumped everything else.
Since an aerospace plane with military capabilities was supposed to be
super-secret, the whole mess had to be sold some other way. The way
it was sold was cost savings, and the programs as presented didn't
even pass the laugh test: dozens of launches a year with vehicle
turnarounds on the order of weeks. If I'd had to present that kind of
BS with a straight face, I'd have had to go into the bathroom and puke
my guts out afterward.
The idea of manned exploration of space got lost in the shuffle. Or,
I should say, it got buried. Putting human beings on the moon or on
Mars captured the public imagination. Scientists argued (correctly)
that most of the scientific goals could be better satisfied with
unmanned deep space probes, and NASA has done a really good job of
pumping out colored photos as fast as it could.
What did capture and what still does capture the public imagination
was putting human beings out there, but neither NASA nor anyone else
had any serious intention of pursuing such an objective because it
didn't satisfy any urgent national priority and it cost too much, and
the public was never told that the mission that really excited it had
essentially been abandoned.
What we were left with was an ill-conceived, high-risk program that
did successfully advance one objective: the program developed a
considerable amount of cutting-edge technology that *would* be useful
for a partial-orbit vehicle with a military mission.
I've misrepresented myself in one respect: I said I *am* an aerospace
engineer. Not quite true, or at least misleading. I left the
business in disgust over a decade ago at seeing how things really
worked. Some people have the stomach for it, some don't. I didn't.
As to what we will get going forward, we shall see. If the congress
and NASA abandon winged vehicles and develop a capsule, feel free to
send me a gloating e-mail. Everything I can discern says it isn't
going to happen.
RM
RM
>> >You *have* to program it away. If your program only knows how
>> >to do one thing to each piece of data, then it doesn't matter
>> >how "streaming" the architecture is.
>>
>> I have no idea what fraction of codes could be successfully rewritten
>> to exploit a streaming architecture, but if the architecture forces
>> you into the get it, do something, put it back mode, you can't program
>> that reality away. If you *do* have a streaming architecture
>> available, you still have to figure out how to take advantage of it.
>> If that's your point, of course I agree.
>
>No, the point was that if the application has been, or could be, converted
>to one supporting a streaming architecture, it would already be done for
>existing (high-performance) hardware - precisely because such an approach
>makes best use even of _existing_ hardware. Just what I'd been telling you
>before, but at the moment, you're not quite willing to listen.
>
I'm not quite that dense or that hardheaded. No one is going to do
the work required to reorganize code in a streaming fashion to gain
25% or 50% or whatever unless is *is* something like Linpack that gets
used so often that it's worth it. If you could pick up a factor even
of two (overall, not just in the parts of the code that can be
successfully streamed), the rules would change.
>> We need to start learning, "It's okay to leave execution units idle,"
>> in the sense that maximum exploitation of execution units is no longer
>> the desiteratum of computing. Moving data around with the goal of
>> maximizing use of execution units is the computational equivalent of
>> selling stale french fries.
>
>That has not been the desideratum (sic) for at least a decade. As I said,
>look at the vector-like optimizations for the Ahtlon - they _purposely_
>let the FUs run idle for two parts of the algorithm - out of three - in
>order to keep computation and communication seperate. And the computation
>part is already streaming as much as it can.
There has to be a certain amount of triage in everyone's day, and
Athlon hasn't made the cut. I have no illusion that I could
understand a microprocessor architecture well enough to get the point
you are trying to make in the time I am able to commit to it. Sorry.
It doesn't help that I've been presented with two apparently
contradictory arguments (Andrew Reilly):
AR>If HPC codes could be written to accommodate streaming
AR>architectures, then you'd already be seeing 100% FU utilisation
AR>on the existing commodity-processor-based MPP HPC systems. You
AR>can get (close to) that for some benchmarks (linpack), but that
AR>doesn't seem to be the general case. The problem is a software
AR>one (at least to a first order of approximation, and on todays
AR>hardware).
and you:
JV>look at the vector-like optimizations for the Ahtlon - they
JV>_purposely_ let the FUs run idle for two parts of the algorithm -
JV>out of three - in order to keep computation and communication
JV>separate [sic].
RM
You are entitled to your opinion, and I respectfully disagree with it.
If you haven't already done so, please read my post of 11/24 in
response to Patrick Schaaf, and the IBM document that is cited
therein:
www.research.ibm.com/journal/sj/402/allen.pdf
My conclusion:
RM>The problems they face are too numerous to discuss in a single web
RM>post, but the bottom line is that they will have to use up months,
RM>if not years, of computing time to get results using models that
RM>are at best educated guesses.
RM>
RM>If there is a parallel in the history of science, I am not aware of
RM>it. The US wants to build the world's biggest computer, IBM wants
RM>to build it for them, and both need a problem that justifies such
RM>an enormous expenditure of money and talent. The conclusion that
RM>they should reach, that the available computational muscle
RM>available to them is not up to the task they have proposed is one
RM>they are unwilling to reach.
If something is worth doing, it is worth doing right.
RM
>I/O is trivial; interrupts always were a bad way of doing it, and many
>older systems did it other ways. Consider, for example, I/O requests
>that don't block and I/O responses that are put into a queue for
>handling when the driver next gets around to it. FIFOs are not exactly
>unusual in hardware :-)
>
>Alternatively, you can go for a seriously multi-threaded approach and
>have all I/O operations blocking, so you need to spin off a new thread
>for each one. Not a major issue if you have enough threads available,
>but it needs a lot of hardware threads - which is perfectly possible
>on modern chips.
True, those would work.
>Multitasking was done without interrupts on quite a few of the earlier
>machines - what you mean is preemptive multitasking (i.e. not the
>coroutine approach). Each process yields control when it gets to
>suitable points, which are required to be no further apart than a
>certain time. Apple took that approach for a long time, and its
>only problem is that a failing program can block the hardware thread.
Yeah... that's a showstopper unfortunately, at least on a
general-purpose machine that's going to have to run a random
collection of programs, many of them badly written.
>My solution to that was an architectural requirement for a 'yield'
>instruction to be called every (say) M instructions or N memory
>references, whichever comes first, and to abort the process if it
>failed to do so. Dead easy to implement.
Hey, that's a good point. Just have a counter count down on each
instruction executed... that would work.
>So where's the problem? :-)
Okay, problem solved ^.^
>Russell Wallace wrote:
>> How would you do I/O and pre-emptive multitasking without interrupts?
>
>A very simple front-end processor, like most mainframes used to employ?
>
>Most of communication would be polled.
Yeah, that works.
>I'd still like a hw method to do task switching though, and timer
>interrupts seems to work.
Nick came up with a solution for that.
>Likewise, it is sometimes nice to be able to say: "Hey, stop whatever
>you're doing, this stuff over here is _really_ critical."
Well, mostly "this stuff over here" is I/O in that case, so the I/O
system can handle it.
>If you offload all of this as well, your main cpu ends up as a fancy
>vector coprocessor, right?
That might be no bad thing; have the main CPU be designed for clean,
fast number crunching, and have something else do the messy stuff.
To get back towards the original topic, that seems to be the sort of
approach Sony's Cell architecture is taking, with an array of units
for number crunching and a separate unit for control and I/O.
Not contradictory at all - we both are saying that those (parts of)
algorithms that can be modified to suit a streaming architecture,
already run exceedingly well - at light speed of the FUs, basically
- on current hardware; so what is the point of developing streaming
hardware?
Jan
>Water immersion generally voids the warranty on most aerospace
>alloys. Jets that skid off the runway into the ocean, even if they
>just get lightly soaked, are junked. Same thing with capsules.
>Stainless steel may not be so subject to it, and Titanium isn't so
>bad, but most capsule structures are aluminum.
I thought aluminium was rustproof, and that this was the reason for
using it on some modern warships (despite the disadvantage that it
catches fire if it gets hot enough), are you saying this is not the
case?
>I would have to disagree. My PhD was in protein structure prediction
>and now I work in genomics and statistical genetics. Revolutionary
>advances in these fields are going to come from new algorithms, not
>from running existing algorithms on faster computers. Faster
>computers sometimes do help, of course. But I think throwing money at
>the hardware side of the problem now is a waste, since we don't really
>know yet what the "right" algorithms will look like.
Nor have we any clue about whether we might find them next year, next
decade or next century. So by all means let's keep looking for better
algorithms, but in the meantime we have two options:
1) Let people keep dying by the millions from diseases like cancer,
AIDS and malaria that we might be able to find improved treatments for
if we know more about the molecular machinery involved.
2) Spend a few billion out of our civilization's vast economic surplus
cranking more speed out of the algorithms we do have.
Option 2 strikes me as vastly superior.
>Robert Myers <rmy...@rustuck.com> wrote:
>>gher...@gw.retro.com (George William Herbert) wrote:
>>>Bandwidth x distance is expensive. Lots of bandwidth over
>>>short distance is not very expensive. Lots of bandwidth across
>>>small distances on a modern chip is very very very cheap.
>>>
>>That's why streaming architectures are attractive, and the point that
>>everyone seems to be missing, or at least that no one bothers to
>>acknowledge, is that it is movement of data *on the chip* that is
>>expensive--in terms of power consumption. I summarized the argument
>>in another post here, and I'm not going to summarize it again. It's
>>in the sc2003 merrimac paper
>>
>>http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf
>
>I am still not getting this.
>
>The criticism was relating to the Blue Gene box,
>which is not a general purpose system moving large
>amounts of data globally on-chip. It is a system with
>large quantities of very small CPUs with associated
>FPUs and local memory on chip, which don't need much
>global routing of data across the chip for well partitioned
>problems (and, poorly partitioned problems are bad matches
>for the sea-of-processors solution anyways).
>
First of all, I owe you an apology. You wanted to respond to two
separate points I had made in a single post. Since you had
significant things you wanted to say on both points and didn't want to
get the two arguments tangled up, you responded in two separate posts.
I found the tone of your first post so smug, presumptuous, and
patronizing, that I didn't give a fair reading to your second post.
Given that you were responding in your first post to a very
strongly-worded opinion with which you strongly disagreed, you had two
options: you could assume that I must have some basis for forming the
very strong opinion that I stated and tried to explore what those
reasons might be, or you could assume that I was an uninformed idiot
who needed to be lectured. You chose the latter, and I regard the
response that I managed to choke out as fairly measured. I regard
that as to my credit.
Unfortunately, I did not have the emotional werewithal to calm down
and separate the two issues. You wrote to me as if I were a
lamebrain, and I childishly responded in kind to your second post.
That is not to my credit. Your second post made better points than I
gave you credit for.
>>>For highly partitionable problems, the cost-effective
>>>optimum partition size can be analyzed and modeled by
>>>looking at the costs of transmitting partition cell
>>>edge state to neighbors versus storing/calculating it
>>>locally. For given problems and chip technologies there
>>>are different optimizations.
>>>
>>>By putting a large number of tiny but moderately powerful
>>>CPU/Custom FP units per ASIC chip, they are getting very
>>>good average neighbor to neighbor bandwidth. Or at least
>>>can do so and presumably did. Going off-chip to the
>>>neighbors on a circuit board hurts, but again is subject
>>>to cost / technology optimization, along with the
>>>CPU capacity per unit and bandwidth internally...
>>>
>>You really do need to read the paper.
>
>The paper isn't addressing this at all.
>It's asserting in section 2 that global wires are
>a bad thing (or, more precisely; slow, high power
>consumption high area thing), which anyone who's
>looked at chip design knows to be a true statement.
>
I had to go back and read section 2 of the paper *again* to make sure
we were reading the same paper. If your intent is to give as little
credit to the insight of the authors and to minimize the importance of
what they are saying, you have done a very good job of it. If you
regard your "everyone knows" statement as the equivalent of section 2
of the paper, then perhaps I should just leave you in peace with that
belief.
Without knowing the detailed layout of the blue gene chips, it's hard
to make judgments. I jumped to the conclusion that an appropriate
model of the blue gene chip would be to think of it as an SMP system
on a chip, with global cache taking the role of main memory. If
that's the correct model, then you would be constantly paying the cost
of fetch-do_something-put_it_back over long distances, if, as I was
visualizing, you were doing streaming access to global cache--not a
bad strategy in terms of throughput, but a disaster from an energy
consumption point of view. As Bill Todd has pointed out, and as I
believe you are implying, more localized cache can change the numbers
dramatically. I acknowledged that in my response to him, but I also
pointed out that the argument eventually falls apart.
Your argument is a little more specific, and it is the kind of
argument that someone who has actually done these kinds of
calculations would make; viz, that the problems most often encountered
in supercomputing allow you to exploit data localization very
effectively to minimize global communication. The strategy would be
very much like cache blocking, except at a more local level of cache,
and we already know that cache-blocked algorithms can have very low
global bandwidth requirements. You make a very good point.
In fact, it is _such_ a good point that I need to spend some time
digging into the details of the blue gene architecture. It may well
be that there is not much to be gained from a streaming architecture,
which is what I believe you have been trying to say.
>>>This is not hard. This is actually computer architecture
>>>childs play to rough sketch out. What's hard is finding
>>>and formulating problems that play well with such architectures.
>>>I have been told that proteins and 3-D FEM are very amenable
>>>to such work, by people doing protein work commercially
>>>and based on hints coming out of the places doing nuclear
>>>weapons simulation (or so I think... they like being semi-
>>>opaque...).
>>>
>>The protein folding stuff may all just be misdirection.
>
>Some people are paying Big Money to do protein folding work,
>regardless of what the nuclear weapons labs are buying.
>
Yes they are. And I don't want human proteins to be patented by those
Big Money people.
RM
Aluminium forms a thin skin of alumina (the oxide) when exposed
to oxygen, or oxygen and water, and thus is rustproof in fresh
water. However, it is VERY badly attacked by salt water, and
sailing boats etc. that use aluminium spars protect them by
anodisation (however that works). Even with that protection,
which may or may not be application to aircraft, one of its normal
failure modes is through corrosion.
So the rough answer is "not in salt water, it isn't".
Regards,
Nick Maclaren.
> You are entitled to your opinion, and I respectfully disagree with it.
> If you haven't already done so, please read my post of 11/24 in
> response to Patrick Schaaf, and the IBM document that is cited
> therein:
> www.research.ibm.com/journal/sj/402/allen.pdf
> My conclusion:
> RM>The problems they face are too numerous to discuss in a single web
> RM>post, but the bottom line is that they will have to use up months,
> RM>if not years, of computing time to get results using models that
> RM>are at best educated guesses.
Well yes I read this. The IBM paper is not a bad high level overview.
And I agree with this part of your analysis: after all, this is
exactly what researchers in the field do now. They develop models,
test them using months of computing time, then try to come up with
better models. I did some of that too.
> RM>If there is a parallel in the history of science, I am not aware of
> RM>it. The US wants to build the world's biggest computer, IBM wants
> RM>to build it for them, and both need a problem that justifies such
> RM>an enormous expenditure of money and talent. The conclusion that
> RM>they should reach, that the available computational muscle
> RM>available to them is not up to the task they have proposed is one
> RM>they are unwilling to reach.
It is this part where I think you're wrong. Blue Gene will be used to
do some good science, I'm sure, and it will be able to do some things
that are prohibitively time consuming on other platforms. I would
also expect that most of its cycles will be spent doing things that in
retrospect will be seen as a waste of time. We seem to agree it won't
be revolutionary. You think this argues for revolutionary approaches
on the hardware side. I think people will eventually become smarter
about formulating the problem. Progress on the hardware side is
already fast enough and I see no urgent need to throw more resources
at that side of the problem.
-- Dave
>> As it is, Mr. Myers (again, no relation) is being very kind about
>> those fragile wings. You can put more engineering effort into the
>> foam insulation on the external fuel tank, but shuttle tiles have
>> fallen off in previous flights, and they will continue to fall off if
>> the current shuttle thermal management system continues to be used.
>
>Columbia's problem, to a large degree, was precisely that the tiles
>were _not_ compromised. In fact, in all tests the assessment, during
>the mission, that the analysis program (massively) overestimated tile
>damage was vindicated. _All_ burn-through problems - and there historically
>were some before Columbia's last mission - were associated with the
>wing leading edge, RCC and its support structure. The tile really _was_,
>as Linda Ham said, "only a maintenance issue".
Somehow I don't think you saw the photo in the New York Times that
showed a large section of wing with missing tiles (a dozen, maybe?)
and exposed aluminum from a previous shuttle flight. No, it didn't
burn through, but I wouldn't call it "only a maintenance issue,"
either. Especially not if I was to be lofted into orbit and expect to
go through reentry in such a contraption. The fact that they probably
were not even a contributing factor in the most recent episode doesn't
prove anything.
RM
> Nor have we any clue about whether we might find them next year, next
> decade or next century. So by all means let's keep looking for better
> algorithms, but in the meantime we have two options:
> 1) Let people keep dying by the millions from diseases like cancer,
> AIDS and malaria that we might be able to find improved treatments for
> if we know more about the molecular machinery involved.
> 2) Spend a few billion out of our civilization's vast economic surplus
> cranking more speed out of the algorithms we do have.
> Option 2 strikes me as vastly superior.
I don't think there are any current issues in cancer, AIDS, or malaria
treatment that would be solvable by throwing more cycles at these
problems. These are not fundamentally computational problems, or, at
least, our understanding is not currently limited by our computational
resources. A few billion dollars well spent on AIDS and malaria
treatment or even on basic research might well save millions of lives.
Spending that money on a faster computer, today, is not going to save
anyone.
-- Dave
I don't think that makes sense; you can't crank out enough more speed
to do interesting things, without the algorithms being better, and it's
pretty difficult even to imagine the scale of the algorithm development
that can be done in ten thousand researcher-years.
But I'm not sure, if you're working at that scale, that you don't
start off by founding hundreds of specialist schools and working up
from there -- consider what the Apollo program did to US society in
general.
If you want to throw money at the problem *now*, you go to Dell, to
Cisco and to Bechtel, buying hundreds of thousands of Intel PCs,
thousands of huge switches and the large-scale building and power
station to house and to power them; you'd be seeing results inside six
months, but you're probably right that a 2^18-system Beowulf is not
going to be even close to 30 times faster than the 2^13-system
Beowulves we have nowadays.
If you want to spend enormous sums on research to build a better
system, knowing that you won't have the chips taped out before the end
of 2005 or fabbed before the end of 2006, go ahead -- but you're not
doing medicine, you're doing computer science, and it's not at all
clear to me that you'd not be better off endowing research groups in
molecular dynamics, and (if compute time is the issue) equipping them
with things like the Big Mac cluster.
Spending money like water on a single piece of equipment gets you some
reasonably-predictable way in a single well-understood direction;
spending money like water on researchers is going to take you long
distances in a whole weird distribution of directions, and, making
some probably-valid assumptions about the honesty and integrity of
researchers, is likely to get you at least a few interesting
breakthroughs.
Tom
Thanks for weighing in with an actual, informed opinion that substantiates
my own WAGs. The point that the gung-ho hardware advocates seem to miss
completely is that it's not nearly sufficient to point out that a
significant advance could be made: you have to justify that advance in
terms of its being the best use of limited resources rather than putting
said resources to some other use - and in that context it really doesn't
matter that the resources in question are vast, only that they are limited
with respect to the potential uses that they can be put to.
It is certainly possible that new hardware might open up unforeseen
opportunities as well as those currently on the table. But that too can be
true of other advances to which the resources might be applied. While
feeling that one's own small corner of the universe is of critical
importance is understandable in human terms, it doesn't make for good policy
decisions (witness the support by Democratic farm-state senators of the
recent abomination of an energy bill, for example - or don't, if that will
severely side-track what has so far been an interesting and at least
moderately focused discussion).
- bill
>If you want to spend enormous sums on research to build a better
>system, knowing that you won't have the chips taped out before the end
>of 2005 or fabbed before the end of 2006, go ahead -- but you're not
>doing medicine, you're doing computer science, and it's not at all
>clear to me that you'd not be better off endowing research groups in
>molecular dynamics, and (if compute time is the issue) equipping them
>with things like the Big Mac cluster.
I'm no expert in molecular biology, but I've seen people who are,
claim that adding a couple more orders of magnitude in computing power
will indeed be scientifically useful...
>Spending money like water on a single piece of equipment gets you some
>reasonably-predictable way in a single well-understood direction;
>spending money like water on researchers is going to take you long
>distances in a whole weird distribution of directions, and, making
>some probably-valid assumptions about the honesty and integrity of
>researchers, is likely to get you at least a few interesting
>breakthroughs.
But yes, I think the approach you suggest would _also_ be worth doing.
>Aluminium forms a thin skin of alumina (the oxide) when exposed
>to oxygen, or oxygen and water, and thus is rustproof in fresh
>water. However, it is VERY badly attacked by salt water, and
>sailing boats etc. that use aluminium spars protect them by
>anodisation (however that works). Even with that protection,
>which may or may not be application to aircraft, one of its normal
>failure modes is through corrosion.
Ah! Interesting.
Any idea why they used it for warships, then?
Also, if you know an aluminium object is going to be immersed in salt
water, why doesn't a layer of waterproof paint solve the problem?
While you are basically correct, it isn't quite as cut and dried as
that. The real use of lots of cycles is to model what is happening
in a more physically realistic way (such as including the presence
of salt ions in the water!) While this will remain unreliable, and
will not lead directly to any benefit, it may lead to understanding,
which may then lead to such benefit.
Regards,
Nick Maclaren.
It's 3 times lighter than steel, so you can put much more junk
on the ship and not have it turn turtle the first time there is
a bit of a blow.
>Also, if you know an aluminium object is going to be immersed in salt
>water, why doesn't a layer of waterproof paint solve the problem?
It chips and cracks in use, especially near moving parts.
Regards,
Nick Maclaren.
There is simply no way to foresee at what point the application of
immense computing power will revolutionize molecular biology. It
could come after the next order of magnitude increase in available
power (possible, but unlikely) or after the next six orders of
magnitude (also possible, but unlikely). It probably won't be a
discrete event, but it _will_ happen...
With one modest caveat. If the human race survives that long. Or,
let us suppose that the human race per se is that resilitent, but
posit the caveat as: if civilization survives that long.
Let's see. How many things will I be accused of being? Hysterical.
Alarmist. Hyperbolic. What have I left out?
There are some things about infectious diseases we already know:
1. The infrastructure of vast portions of an entire continent,
Africa, is on the verge of collapse on a scale that will make any
other modern human catastophe pale by comparison because of HIV. The
glowing economic future of China may not ever materialize because of
HIV.
2. Infectious diseases arise as if out of nowhere and because of
modern transportation girdle the globe in a matter of days. It is
astonishing that HIV hasn't transformed itself into something more
durable, virulent and deadly because of the frequency with which it
mutates and reproduces itself. It seems as if the SARS epidemic was
an exceedingly close call.
3. It is clear that infectious diseases can be used as weapons of mass
destruction and that people have made preparations for using them in
that way.
4. There are now bacterial infections that are incurable because there
are strains of bacteria that are resistant to all known drug
therapies. It is not impossible to imagine reaching the point that we
will simply not be able to operate hospitals because they are such
effective repositories for such bacterial strains.
While I don't want to turn this into a tussle with Mr. Hinds or anyone
else about what is and is not possible, the fact is that we know the
ab initio equations that apply, and we know how to solve them. The
kind of grimy guesswork that Mr. Hinds is familiar with is
necessitated not by any lack of fundamental understanding, but by a
lack of raw computing power.
Yes, you have to make choices. If you spend money on very fast
computers and on very basic research, that is money that will not be
available to be spent on front-line clinical research that probably
will save lives against known threats faster.
In fact, the most effective ways to slow the spread of the AIDS virus
have very little, if anything, to do with _any_ kind of cutting-edge
research or indeed even with ordinary clinical medicine.
The irony is that, as Mr. Herbert has pointed out in a different post,
there are people who understand the immense, almost unimaginable,
opportunities for profit that exist and who are willing to venture
large sums of money in the hope of capturing some of that potential
profit. Those with great faith in free markets will see it as a sign
that the problems I have mentioned should be left to the inventiveness
of free markets.
I have great faith in free markets, but there are some problems that
cannot be left to them, and public health is one of them.
It is possible to pour *so* much money into a given area of research
that it becomes like pushing on the end of a rope. Research into HIV
has sometimes seemed like that because research on that virus diverted
a finite pool of competent talent away from other important problems
like cancer. Because of the vast explosion in infrastructure that has
arisen, partly in response to the AIDS epidemic, I do not believe that
such a thing is currently happening.
The opposite circumstance is manifestly the case in semiconductor
physics, computer architecture, computer science, and related fields.
There are more competent bodies than there are jobs. Let's put that
vast pool of talent to work.
I will now go to a safe place and wait for the incoming artillery.
RM
To got that little bit further:
What is the point of developing streaming hardware that is *known* to not
be able to support any of the existing applications in full, because
they all require operations that don't look like streaming. As a
precursor to streaming hardware, you need streaming software. That
streaming software *could* be written now, if it were possible, and it
would run on the current crop of MPP hardware (significantly) faster than
whatever the software is doing at the moment, because it would put lower
pressure on the communications links.
Robert seems to be arguing: build the hardware anyway, and the software
will have to follow. Sorry, but that's how to build a white elephant.
Let's look from a different angle:
Posit a reconfigurable systolic array: a streaming architecture for want
of a better name. A sea of functional units with little or no storage
between them. To be sufficiently configurable to do a useful variety of
things, you want each of these FUs to be able to do a variety of different
things, otherwise you have to spend bulk wire costs, building connections
between function-specific FUs. In fact, you get benefit from allowing
each FU to do several things to each data item before it is passed on, as
you said. To allow an FU to do several things in sequence, you give it an
instruction set and a program. To let it combine several streams and
iterate over several operations, you give it some scratch-pad RAM. Sounds
like Blue Gene (or any of the several MPP-on-a-chip designs that have been
posited and built before). This is all just good engineering design, *if*
you have code that can support it. Look at the current crop of processor
chips that go into mobile phone base stations. Some of them look exactly
like that.
--
Andrew
One could point out that HIV can be controlled by behaviour
modifications or draconian societal measures more easily than by
scientific miracles.
>
> 2. Infectious diseases arise as if out of nowhere and because of
> modern transportation girdle the globe in a matter of days. It is
> astonishing that HIV hasn't transformed itself into something more
> durable, virulent and deadly because of the frequency with which it
> mutates and reproduces itself. It seems as if the SARS epidemic was
> an exceedingly close call.
Maybe we need more isolationism rather than computers built out of
unobtainium.
>
> 3. It is clear that infectious diseases can be used as weapons of mass
> destruction and that people have made preparations for using them in
> that way.
>
> 4. There are now bacterial infections that are incurable because there
> are strains of bacteria that are resistant to all known drug
> therapies. It is not impossible to imagine reaching the point that we
> will simply not be able to operate hospitals because they are such
> effective repositories for such bacterial strains.
Hospitals can be operated. Today's hospitals are really surprisingly
primitive, compared to what folks know how to do. Compare the
cleanliness of an operating room to the cleanliness of a semiconductor
plant. Do operating rooms these days have have laminar airflow with
HEPA filters? If they don't they ought to.
>
> While I don't want to turn this into a tussle with Mr. Hinds or anyone
> else about what is and is not possible, the fact is that we know the
> ab initio equations that apply, and we know how to solve them. The
> kind of grimy guesswork that Mr. Hinds is familiar with is
> necessitated not by any lack of fundamental understanding, but by a
> lack of raw computing power.
We don't usually design circuits by solving Maxwell's equations.
>
> Yes, you have to make choices. If you spend money on very fast
> computers and on very basic research, that is money that will not be
> available to be spent on front-line clinical research that probably
> will save lives against known threats faster.
Interesting question, whether to spend the money on projects with
immediate payoff or go for the home run. Cancer research is analogous.
>
> In fact, the most effective ways to slow the spread of the AIDS virus
> have very little, if anything, to do with _any_ kind of cutting-edge
> research or indeed even with ordinary clinical medicine.
>
> The irony is that, as Mr. Herbert has pointed out in a different post,
> there are people who understand the immense, almost unimaginable,
> opportunities for profit that exist and who are willing to venture
> large sums of money in the hope of capturing some of that potential
> profit. Those with great faith in free markets will see it as a sign
> that the problems I have mentioned should be left to the inventiveness
> of free markets.
Until the people demand price controls. As it is, it seems as if the
americans and to some extent the europeans are funding these advances.
>
> I have great faith in free markets, but there are some problems that
> cannot be left to them, and public health is one of them.
>
> It is possible to pour *so* much money into a given area of research
> that it becomes like pushing on the end of a rope. Research into HIV
> has sometimes seemed like that because research on that virus diverted
> a finite pool of competent talent away from other important problems
> like cancer. Because of the vast explosion in infrastructure that has
> arisen, partly in response to the AIDS epidemic, I do not believe that
> such a thing is currently happening.
>
> The opposite circumstance is manifestly the case in semiconductor
> physics, computer architecture, computer science, and related fields.
> There are more competent bodies than there are jobs. Let's put that
> vast pool of talent to work.
OK, let's see some evidence that these radical architectural concepts
will actually work better on the problems people are interested in. I'm
an agnostic about wizbang new architectures.
>
> I will now go to a safe place and wait for the incoming artillery.
Not from me dood. If you can show it works better you ought to be able
to get it funded.
>
> RM
IBM used to have bunches of people doing stuff with no other purpose
than to "advance science" or get their names and papers published. The
rest of us carried them on our backs. Those ego trips are over. Now
the research division's work has to have Some Relation to the goals of
the corporation. No more studying stellar evolution or other non
business related topics.
Zurich Research is still there. So far as I know Almaden is still
there. The disk business not so good, so San Jose is now mostly
Hitachi.
boo hoo.
> ...?
>
> Jan
>
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:1cbfsvgoqfkb995c4...@4ax.com...
<snip>
>>
>> There are some things about infectious diseases we already know:
>>
>> 1. The infrastructure of vast portions of an entire continent,
>> Africa, is on the verge of collapse on a scale that will make any
>> other modern human catastophe pale by comparison because of HIV. The
>> glowing economic future of China may not ever materialize because of
>> HIV.
>
>One could point out that HIV can be controlled by behaviour
>modifications or draconian societal measures more easily than by
>scientific miracles.
>
Would you want to be a part of such a society? Would you want to be
part of a society that tried to impose such measures on other
societies?
>>
>> 2. Infectious diseases arise as if out of nowhere and because of
>> modern transportation girdle the globe in a matter of days. It is
>> astonishing that HIV hasn't transformed itself into something more
>> durable, virulent and deadly because of the frequency with which it
>> mutates and reproduces itself. It seems as if the SARS epidemic was
>> an exceedingly close call.
>
>Maybe we need more isolationism rather than computers built out of
>unobtainium.
>
If such a thing comes to pass, it will be because we are already on
the brink of a "Blade Runner" future. Do you want to let things go
that far?
<snip>
>>
>> 4. There are now bacterial infections that are incurable because there
>> are strains of bacteria that are resistant to all known drug
>> therapies. It is not impossible to imagine reaching the point that we
>> will simply not be able to operate hospitals because they are such
>> effective repositories for such bacterial strains.
>
>Hospitals can be operated. Today's hospitals are really surprisingly
>primitive, compared to what folks know how to do.
Yes they are.
>Compare the
>cleanliness of an operating room to the cleanliness of a semiconductor
>plant. Do operating rooms these days have have laminar airflow with
>HEPA filters? If they don't they ought to.
>
Fixing operating rooms alone won't do it. You've been in a hospital
recently--a good one. Unless it was unlike any hospital I've ever
seen, can you imagine turning the chaos that reigns in such a place
into something resembling a semiconductor plant?
Medicine has become dangerously reliant on antibiotics to fix its
mistakes, and science is inventing new tricks more slowly than
bacteria are learning how to defeat them. I agree that that's a good
argument for taking steps to become less reliant on antibiotics, but I
think some of the things you are proposing are almost as unrealistic
as building computers out of unobtainium.
>>
>> While I don't want to turn this into a tussle with Mr. Hinds or anyone
>> else about what is and is not possible, the fact is that we know the
>> ab initio equations that apply, and we know how to solve them. The
>> kind of grimy guesswork that Mr. Hinds is familiar with is
>> necessitated not by any lack of fundamental understanding, but by a
>> lack of raw computing power.
>
>We don't usually design circuits by solving Maxwell's equations.
>
The basic conservation equations are as solidly rooted in Maxwell's
equations as anything possibly could be.
I believe that first principles calculations of things like
resistivity and dielectric constants are possible. Once you adjoin
Ohm's law and allow for a measured dielectric constant, you've got
passive electronics well in hand.
Just from a count of the number of atoms involved, simulating a
single, realistic FET from first principles would be, just at the
moment, even more difficult than first principles simulations of
simple human proteins. For active devices, you have to resort to
modeling.
There is another important difference. While doing ab initio solid
state physics might take some of the trial and error out of device
design, the tiny little differences from one transistor to the next
don't matter. It's a good thing they don't matter because they're not
reproducible.
In the case of drugs and proteins, by contrast, details at the atomic
level do matter and are reproducible.
<snip>
>>
>> The opposite circumstance is manifestly the case in semiconductor
>> physics, computer architecture, computer science, and related fields.
>> There are more competent bodies than there are jobs. Let's put that
>> vast pool of talent to work.
>
>OK, let's see some evidence that these radical architectural concepts
>will actually work better on the problems people are interested in. I'm
>an agnostic about wizbang new architectures.
That's a fair challenge.
RM
So consider a machine which, every M cycles, inserts a "yield"
instruction into the instruction stream. This happens in instruction
fetch, and there is no PC associated with the yield.
Now you don't have to worry about proving that code paths have length
M or less and so forth. My guess is that compilers would end up forced
to stick yield instructions into nearly every basic block, which would
be a bummer.
Nick> TLB misses are similarly easy to deal with
I disagree.
I'd like to handle TLB misses by having an SMT-like core launch a little
thread which loads the necessary entry into the TLB. The idea here is
that the main thread can continue, and TLB misses can be parallelized to
some extent. If/when the TLB miss escalates into a page miss or
something else which requires rescheduling the main thread, the
threadlet serializes with any other threadlets, then shuts down the main
thread and jumps into the O/S to handle the resynchronization.
(That "serialize with other threadlets" gets sticky.)
I do like the idea of building FPUs that can handle denorms and Inf and
so forth in hardware with no scheduling change, just because these would
wipe out a major category of performance instability and also some
instruction scheduling headaches.
Nick> and machine checks are separated into 'process abort' ones and
Nick> ones fixed up in hardware and queued to a separate thread for
Nick> logging.
How big is the queue? What happens when it overflows? Interrupts are
pretty nice for all these things where hardware provides finite
resources and software virtualizes them to appear infinite.
> >>
> >> 2. Infectious diseases arise as if out of nowhere and because of
> >> modern transportation girdle the globe in a matter of days. It is
> >> astonishing that HIV hasn't transformed itself into something more
> >> durable, virulent and deadly because of the frequency with which it
> >> mutates and reproduces itself. It seems as if the SARS epidemic
was
> >> an exceedingly close call.
> >
> >Maybe we need more isolationism rather than computers built out of
> >unobtainium.
> >
>
> If such a thing comes to pass, it will be because we are already on
> the brink of a "Blade Runner" future. Do you want to let things go
> that far?
Blade runner was not society protecting from external disease threats.
Totally different situation.
Nice try, however. How do You Think the western societies will react
after the first epidemic that kills a million or ten million people in a
short period of time?
This concludes my contributions to this particular subthread. It's
interesting but sort of off topic.
del
HIV, at least in its current form, is not a threat to society in
developed countries. Smallpox was a disease with obvious symptoms
that ran its course in a very short time. The only way to identify
otherwise apparently healthy carriers of HIV is through a blood test.
No country that I can think of, with the possible exception of China,
has the means or the will to carry through the kind of program of
identification and segregation of carriers that would even make a dent
in the epidemic.
As to my predicting the collapse of society, let's please be careful.
In certain parts of sub-Saharan Africa, it has already happened.
Forty million infected worldwide is a big number, but that's out of
six billion human beings, or 2/3 of one percent of the human race. It
is the maldistribution of those infection rates that raises the
possibility of societal calamity. It is easy to defend adult
infection rates of 25% in some countries in Africa, and I have heard
infection rates claimed as high as 40% among adults aged 15-49.
>> >>
>> >> 2. Infectious diseases arise as if out of nowhere and because of
>> >> modern transportation girdle the globe in a matter of days. It is
>> >> astonishing that HIV hasn't transformed itself into something more
>> >> durable, virulent and deadly because of the frequency with which it
>> >> mutates and reproduces itself. It seems as if the SARS epidemic
>was
>> >> an exceedingly close call.
>> >
>> >Maybe we need more isolationism rather than computers built out of
>> >unobtainium.
>> >
>>
>> If such a thing comes to pass, it will be because we are already on
>> the brink of a "Blade Runner" future. Do you want to let things go
>> that far?
>Blade runner was not society protecting from external disease threats.
>Totally different situation.
>Nice try, however. How do You Think the western societies will react
>after the first epidemic that kills a million or ten million people in a
>short period of time?
>
It might make a good made-for-TV movie. In fact, I'm kind of
surprised that no one has. If it is not too late (as it might be if
smallpox *were* suddenly introduced into the population), I agree that
otherwise unthinkable restrictions on civil liberties would be the
likely result and that developed countries would be likely to survive
such an onslaught, although possibly with a massive loss of life.
>This concludes my contributions to this particular subthread. It's
>interesting but sort of off topic.
>
When the Cray-I came out, people started talking crazy, and I happened
to be in the midst of two flavors of that craziness: one was about
what was possible in terms of CFD, the other was about what was
possible in terms of computer-generated imagery. I was, in fact,
involved in both, and there is a movie out there that has some of my
fluid mechanics in it. As overblown as the claims as to what was
possible seemed to be at the time (and they were wildly overblown with
a mere Cray-1 at anyone's disposal), all but the most extravagant
claims have largely been met. Things that I thought were just plain
nutty to predict have come to pass.
I'm trying to get a different kind of crazy talk started, in the hope
that it, too, might come to pass.
RM
Ahhh... Sutherland's "Wheel of Reincarnation" strikes again... ;-} ;-}
In 1972, at Digital Communications Associates, I designed the kernel
for a realtime embedded operating system for a networking node which was
based on *precisely* that notion: The system ran with interrupts *OFF*,
and the programmers were required to insert "@YIELD" macros[1] every
so often[2] in the code. The "@YIELD" macro was logically a no-op[3]
unless some I/O event had occurred that needed to run a task of higher
priority than the current task. Worked like a charm!!
Despite the fact that the @YIELDs were inserted manually, since the rules
were very simple[2] and compliance was easy to verify by inspection, we
had almost no bugs due to misplaced @YIELDs. [Lots of *other* bugs, but
not those.]
Actually, inserting the @YIELDs manually provided a significant coding
advantage: one was guaranteed that an interrupt *wouldn't* occur anywhere
else, and thus one didn't have to worry about critical sections when
manipulating shared data or using extended CPU register state (e.g.,
the "MQ" register, etc.) -- everything between sucessive calls to @YIELD
was a critical section. Conversely, @YIELD didn't have to save/restore
anything but the PC and the VM page table base register[5], so it ran
faster, too.
The resulting product had *great* performance (for the time), actually
quite a bit faster than DEC's own PDP-11-based network processor... ;-}
-Rob
[1] The platform was a DEC PDP-8. The code was written in assembler,
with a pre-pass of the code through the "8BAL" macro processor.
[8BAL macros started with "@", hence the "@YIELD" above.]
[2] The system was handling a large number of asynchronous and synchronous
serial ports without DMA, which required character-at-a-time service
within a small latency to avoid overflowing (or for output, underrunning)
some fairly small FIFOs, so we chose 200 cycles as the absolute maximum
between @YIELDs. Also, to avoid accidents, we also required that any
loop should contain at least one @YIELD in its body (even short ones).
[On a PDP-8, instructions (other than I/O) would take 1-5 cycles,
depending on mem-ref or not, indirection or not, and RMW or not
(all known at the time of coding). The coding style was such that
subroutines were usually less than ~50 instructions anyway. So the
rule of "at least one @YIELD per 200 cycles or loop, whichever comes
first" was easy to follow.]
[3] Well, actually, since on the PDP-8 enabling/disabling interrupts was
a *lot* cheaper than even calling a simple subroutine -- not to mention
polling several I/O devices, the @YIELD macro actually expanded into
the following three instructions:
ION ; turn interrupts on
CLA ; Clear the AC (delay slot for ION)
IOF ; turn interrupts off again
which had the side-effect of clearing the accumulator. But that
was o.k., since one needed quite a few CLAs anyway in PDP-8 code,
so @YIELD was documented to do that. If no new interrupt request
had been asserted since the last @YIELD, the above took only 3 cycles,
which was the same as a single "ISZ <mem_loc>".[4]
[4] "Increment memory location and Skip next instruction if result is Zero".
[5] Oh, yeah, we also built an add-on board for the PDP-8 that added a
*small* amount of virtual memory to the system (8 pages of 8 words
each), which was used to alleviate the problem of having 15 bits of
address space on a machine with only 12-bit pointers (and no index
registers)! But that's a story for another time...
-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607
RW> ... Also, to avoid accidents, we also required that any
RW> loop should contain at least one @YIELD in its body (even
RW> short ones). ...
So, say, a linear search would go something like test,yield,nextptr,
deref,test,... ? How expensive was the NOP case?
BM
See my parallel reply to Nick, where I point out that there are *advantages*
to manual (or, for compiled code, at least explicit) placement of the
yield ops, namely: (1) Code between yields forms an implicit critical
section, thus requiring no other synchronization primitives for inter-task
communication (on a single CPU, at least); (2) the protocol between the
user code and the "scheduler" (what runs when a "yield" needs to actually
do something) can be designed to minimize the state which must be saved
and restored across a yield, speeding up both the user code and the yield
handler.
Having the yield be done at random places in the instruction stream
(random from the point of view of the user or the compiler) destroys
both of these advantages.
<aside>
I can't count the number of times I've wanted the converse of a "yield"
for Unix user processes, that is, a *cheap* way [cheaper than a system
call] to tell the kernel "don't interrupt or reschedule me for the next
few XXX microseconds". [To avoid bugs or DOS attacks, of course, that
operation would have to explicitly *allow* an interrupt/reschedule at
the moment it was executed, of course.]
</aside>
+---------------
| Now you don't have to worry about proving that code paths have length
| M or less and so forth.
+---------------
See my parallel reply to Nick. It's not so bad, for many applications.
In fact, there's quite a bit of existing published literature about
deferring stack overflow checks and deferring garbage collection [and
maybe also automatic profiling or "metering"] that would probably be
applicable to the "yield" problem as well.
+---------------
| My guess is that compilers would end up forced to stick yield
| instructions into nearly every basic block...
+---------------
Not necessarily -- only those blocks which end with a *backwards*
branch. For branches that implement an N-way fork/merge flow path,
the compiler can just keep track of the maximum duration of any
fork, for example. [But also see the above-mentioned papers on
deferring other kinds of tests, which cover the case when one fork
is much more expensive than the others.]
But even if you did have to...
+---------------
| which would be a bummer.
+---------------
Not necessarily, depending on the cost of the "yield" op. In the case
I cited in my parallel reply, the cost was the same as an ordinary integer
operation.
-Rob
Would you settle for a TV series? The Brit "The Survivors", which
regularly runs (in syndication?) on PBS. 99+% fatal worldwide
epidemic. Dunno when the series was originally filmed.
Yes, basically. Say you're trying to find an integer value in a
zero-terminated vector (C's "strchr()" with 12-bit chars, approx.) --
the inner loop might look like this:
; "nval" = negative (2's compl.) of value being searched for,
; "ptr" = address of vector minus 1.
loop: @yield ; [assert: AC == 0 afterwards.]
isz ptr ; step to next [assert: ptr != 0 ]
tad i ptr ; pick up item
sna ; skip non-zero
jmp not_found` ; zero is end-of-list
tad nval ; add minus value-to-test
sza ; skip if zero (equality)
jmp loop` ; nope, try next item
... ; Found! "ptr" contains location.
+---------------
| How expensive was the NOP case?
+---------------
Cheap. Only 3 cycles, same as a single ISZ or a TAD_I or a JMS_I
(except that a TAD indirect of an auto-index location would be 4,
and on some versions of the PDP-8 a JMS_I is 4, too). The above code
takes 14 cycles per failing trip through the loop, so the @YIELD is
just over 20% in this case, but that's artificially-simple code --
unwinding the loop just once would drop the @YIELD overhead in half
(as well as get rid of one jump).
In real production code, the size of the @YIELD (3 words) was more
of an issue than the speed. [PDP-8 programmers tended to be maniacs
about code size...]
-Rob
It apparently depends greatly on the aluminum alloy that is used. Many
aluminum boats are used in salt water. They are not anodized. Neither
are the aluminum masts. The problem with airplanes is that they were
not designed for nor tested to withstand salt water immersion. And it
is probably really hard to wash the salt off to make sure nothing bad
happens 10 years down the road.
The lovely folks on rec.boats will be happy to enlighten you all.
del cecchi
At least in the US Navy, the use of aluminum as a major structural
element on warships was abandoned years ago, largely as a result of the
British experience in the Falklands, IIRC. (when a warship is hit, it's
almost inevitably going to have major fires, and when much of the
superstructure is aluminum, it starts to melt and then burn. It makes
damage control problematic, to say the least)
>>>Also, if you know an aluminium object is going to be immersed in salt
>>>water, why doesn't a layer of waterproof paint solve the problem?
>>
>>It chips and cracks in use, especially near moving parts.
>
> It apparently depends greatly on the aluminum alloy that is used. Many
> aluminum boats are used in salt water. They are not anodized. Neither
> are the aluminum masts. The problem with airplanes is that they were
> not designed for nor tested to withstand salt water immersion. And it
> is probably really hard to wash the salt off to make sure nothing bad
> happens 10 years down the road.
If you think about all the things _in_ an aircraft that would need to be
torn out and replaced, plus needing the entire stucture opened up and
thoroughly washed out with fresh water, it almost certainly would be
much cheaper to simply buy a new aircraft. There are seaplanes built to
operate in on saltwater, but even they would scrapped if they sank in
salt water, I think (unless they're irreplaceable for some reason). For
naval aircraft, especially carrier-based, corrosion control due to
salt-spray is truly a major concern.
--Larry
Or unroll the loop body (or its equivalent) by K, and place an @YIELD
at the exit to catch premature terminations (e.g., when the search is
done). You have then reduced the dynamic number of yields to roughly
1/K of the original, at a code size cost.
One could also break a loop into two, with the inner loop having known
bounds. Since the bounds are known, there is a known maximum distance
between yield checks.
/* loop does 10n iterations */
for ii = 1 to n
for i = 1 to K /* K constant */
<body>
end
@yield
end
@yield
For large enough K, this could save code space.
Best,
Thomas
--
Thomas Lindgren
"It's becoming popular? It must be in decline." -- Isaiah Berlin
It could be done, but doesn't provide any of the advantages of a
visible yield, such as the ability to have a lot of registers without
a very large context. Or instruction sequences that can't be split,
so that they can pass hidden data (think prefetch etc.) And so on.
>Now you don't have to worry about proving that code paths have length
>M or less and so forth. My guess is that compilers would end up forced
>to stick yield instructions into nearly every basic block, which would
>be a bummer.
No, they wouldn't. There are a lot more small basic blocks caused by
conditionals (including conditional functions) than you think.
>Nick> TLB misses are similarly easy to deal with
>
>I disagree.
Well, it has been done very successfully, very often. It was a solved
problem before 1970.
>Nick> and machine checks are separated into 'process abort' ones and
>Nick> ones fixed up in hardware and queued to a separate thread for
>Nick> logging.
>
>How big is the queue? What happens when it overflows? Interrupts are
>pretty nice for all these things where hardware provides finite
>resources and software virtualizes them to appear infinite.
Exactly the same as when the interrupt rate exceeds the capability of
dealing with on current systems, except that it is rather easier to
detect and handle. In particular, aborting the process that queues
such a machine check when the queue becomes full is likely to kill
the right process; current mechanisms more often kill the system.
Regards,
Nick Maclaren.
And again, and again. Yes, it is a very old, and well tried, idea.
>In 1972, at Digital Communications Associates, I designed the kernel
>for a realtime embedded operating system for a networking node which was
>based on *precisely* that notion: The system ran with interrupts *OFF*,
>and the programmers were required to insert "@YIELD" macros[1] every
>so often[2] in the code. The "@YIELD" macro was logically a no-op[3]
>unless some I/O event had occurred that needed to run a task of higher
>priority than the current task. Worked like a charm!!
Interesting. It was quite popular with the networking people at about
that time, probably on the grounds that similar problems lead to
similar solutions.
>Despite the fact that the @YIELDs were inserted manually, since the rules
>were very simple[2] and compliance was easy to verify by inspection, we
>had almost no bugs due to misplaced @YIELDs. [Lots of *other* bugs, but
>not those.]
I would make a small bet that it was precisely because the rules were
so simple ....
Regards,
Nick Maclaren.
Maybe I am a bit out of date then, but you are certainly correct about
the issues. However, I am not totally convinced about the resistance
of any aluminium alloy to 25 years of exposure to sea water, especially
when in contact with other metals.
Regards,
Nick Maclaren.
> In article <C3WdnaCQlP1...@speakeasy.net>,
> Rob Warnock <rp...@rpw3.org> wrote:
>>In 1972, at Digital Communications Associates, I designed the kernel
>>for a realtime embedded operating system for a networking node which was
>>based on *precisely* that notion: The system ran with interrupts *OFF*,
>>and the programmers were required to insert "@YIELD" macros[1] every
>>so often[2] in the code. The "@YIELD" macro was logically a no-op[3]
>>unless some I/O event had occurred that needed to run a task of higher
>>priority than the current task. Worked like a charm!!
>
>
> Interesting. It was quite popular with the networking people at about
> that time, probably on the grounds that similar problems lead to
> similar solutions.
I used a very similar idea to run the original IBM PC serial ports at
115 Kbit/s: Since these ports were totally unbuffered, this really
couldn't be done except by polling with all interrupts off.
The way it worked was that the two endpoints started by agreeing on a
basic block size N, calculated to be as large as possible without either
end ending up with serious problems caused by lost interrupts.
During the actual transfers each packet was sent by splitting it up into
fragments of size N, with explicit synchronization betwen each fragment.
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
There are probably some aluminium dinghies still around that are getting
on for that old. One of my dad's wore through on the bottom by being run
up on the beach, before it showed any signs of corrosion. The trick
really is not to mix metals. Long-lived aluminium dinghies are pretty
close to 100% the same stuff everywhere. Masts survive steel fittings
only by dint of good annodizing to start with, perhaps painting and
definitely sealing, fastidious cleaning, and (generally) not getting too
salt-wet in the first place. The spars on my Laser don't get any of that,
unfortunately, and many of the rivet holes are badly corroded already,
after only a few years. I will have to replace them soon.
As far as I know, annodizing is just a really thick layer of aluminium
oxide. I'm afraid that I don't know how that's applied, though.
--
Andrew
> 2) Spend a few billion out of our civilization's vast economic surplus
> cranking more speed out of the algorithms we do have.
If you want to use "our civilization's vast economic surplus" to stop
people from dying from things like cancer, AIDS, malaria, I think
there's no need for faster computers or better molecular biology.
At least for AIDS and malaria, it has to do with hygiene and education,
which we know very well how to provide. I.e. the problem is political.
Same thing with non-diseases like malnutrition, of course.
Technological progress only allows us to cure, not to prevent, but
prevention is generally the cheapest and most effective solution to
problems like diseases.
Stefan
> While you are basically correct, it isn't quite as cut and dried as
> that. The real use of lots of cycles is to model what is happening
> in a more physically realistic way (such as including the presence
> of salt ions in the water!) While this will remain unreliable, and
> will not lead directly to any benefit, it may lead to understanding,
> which may then lead to such benefit.
I agree entirely. My point was more that, you can't justify this sort
of research by claiming it is urgently needed to save lives, and solve
current medical problems. Indeed it may, eventually. But the road
from the lab to the clinic is a long and circuitous one.
-- Dave
> I'm no expert in molecular biology, but I've seen people who are,
> claim that adding a couple more orders of magnitude in computing power
> will indeed be scientifically useful...
I have no doubt that it will be useful. For me, the question is,
whether there is sufficient reason now to justify a "mega project" to
provide that power, rather than waiting a few years for it to arrive
in due course as general purpose computers are improved.
-- Dave
>Russell Wallace <wallacet...@eircom.net> wrote:
>
>> I'm no expert in molecular biology, but I've seen people who are,
>> claim that adding a couple more orders of magnitude in computing power
>> will indeed be scientifically useful...
>
>I have no doubt that it will be useful. For me, the question is,
>whether there is sufficient reason now to justify a "mega project" to
>provide that power, rather than waiting a few years for it to arrive
>in due course as general purpose computers are improved.
>
No one really knows the answer to that. What bothers me is that
various branches of the federal bureaucracy have made bold statements
about the importance of large-scale computation to practically every
area of economic growth where technology is involved. I'm not going
to go bother to find the quotes, because they're more or less what
you'd expect.
By comparison with the rhetoric, the projects they have funded are
lame and unimaginative. The DOE asked IBM to build them a computer
that would leapfrog the Earth Simulator on Linpack, and that's exactly
what IBM is doing. Now the politicians and bureaucrats can go back to
fundraising and drawing their paychecks, respectively.
If Blue Gene is the best the US can muster to lead us into a new era,
we're in deep trouble. Blue Gene just doesn't have what it takes to
make a meaningful impact on molecular biology. What it will produce
is more of what other programs like it have produced: lots of color
plots.
RM
[SNIP]
> If Blue Gene is the best the US can muster to lead us into a new era,
> we're in deep trouble. Blue Gene just doesn't have what it takes to
> make a meaningful impact on molecular biology. What it will produce
> is more of what other programs like it have produced: lots of color
> plots.
Pfft, you lack imagination. I've worked on a (very few)
apps that I'd kill to run on hardware like that. With
any luck lots of other people will think like you and
I'll be able to pick up a box or two from the skip in a
couple of years time. Hell, if it runs as cool as some
of the claims I've seen it'll be a perfect box to run
next to my PC. :)
If these boxes are good for signal-processing, I'm sure
that there could be a few big contracts for mobile/semi
-portable applications.
Cheers,
Rupert
My parents own a 25-ish year old aluminum outboard boat,
which other than having been designed for calm lakes and used
in rough salt water most of its life, has held up ok.
And as others have pointed out, it is used in ships
and in yacht hulls to some extent.
Marine-friendly aluminum alloys form a hard aluminum oxide
layer which resists further pitting or corrosion, on contact
with salt water or salty air.
But they don't to terribly well with salt or salt water in
narrow crevices, in a lot of cases. And as has been pointed
out in other posts, using different materials with differing
cathodic/anodic galvanic potentials will rapidly corrode out
the anodes.
The key problems with aircraft which get immersed are;
1) Most aircraft structures are thin sheets and stiffeners
which are riveted together, and the whole inside of the
contact areas between sheets and stiffeners and such will
absorb some water and may corrode out from the inside of
those joints.
2) Some aircraft structural alloys do very poorly in contact
with salt water, as opposed to the marine grade and more
general purpose alloys.
One might posit that you could go in and drill out every
single rivet in the structure, completely dissassembling it,
and then clean and reassemble the structure and reuse it.
But that will cost as much as a new airframe.
And all the systems: all the wiring, the engines, the
interiors, etc, those all are going to need to be junked
anyways, because some of them react even worse to the
saltwater than aircraft structures do.
Easier to just write it off and junk it. The salvage value
of most of the components is zero, and the rehabilitation
efforts are extreme...
-george william herbert
gher...@retro.com
I think the problem I see with your attitude on this issue is
that the truly fundamentally hard problems seem to be to some
large degree to be algorithms and software rather than hardware.
If there were a software pull, someone would push hardware
in that direction. There is good reason to think that from
a technical perspective, developing the sorts of software to
do those problems better would be good, but it's proven rather
hard in practice, including in PhD thesies and random researchers
reaching out in freethinking directions. Even if it had to be
hand-done to some degree, if it was doable for some of the problems
there would be money right there for doing it.
Throwing money at hardware problems to run the known algorithms
in larger parallelism works, to a predictable degree. To make the
great leaps, we need some great leaps in concept and algorithm
which have as of yet at least largely eluded humankind as a whole.
That is not the sort of problem that you can solve with money.
It's waiting the right bright person and the right supporting
developments for them to make the right insights.
The question is; given the value of some of those problems,
is it worthwhile for society to spend money on the ugly hack
way we do it now being scaled up now, or should we just not bother
and wait for the great leap that may come?
The government wants to do some of those things now, and commercial
companies are making money off doing some of those things now,
so I am guessing that the value of those problems is worth the
incremental improvements we can make now.
Banging your head on the wall and demanding that the next
Einstein, Feynman, Hawking pop up and solve 'the problem' is
not a reasonable strategic plan 8-)
-george william herbert
gher...@retro.com
>If these boxes are good for signal-processing, I'm sure
>that there could be a few big contracts for mobile/semi
>-portable applications.
Given that the market for embedded many-cpu message-passing boxes
already buys a lot of PowerPC-based boxes...
-- greg
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:ovvksvc8sultq9ald...@4ax.com...
>
>[SNIP]
>
>> If Blue Gene is the best the US can muster to lead us into a new era,
>> we're in deep trouble. Blue Gene just doesn't have what it takes to
>> make a meaningful impact on molecular biology. What it will produce
>> is more of what other programs like it have produced: lots of color
>> plots.
>
>Pfft, you lack imagination.
Possibly so, but I haven't engaged in the kind of overblown rhetoric
the DoE has to justify this machine, whose only real purpose is to
knock the Earth Simulator off the top spot
http://www.ultrasim.info/esrr_meeting/bland.ppt
They're not even clever enough to cover their own tracks. The meeting
was called
"Earth Simulator Rapid Response Meeting"
Was the US under attack from Japan in April 2002 and were they using
the Earth Simulator to do it?
At that meeting, IBM was the only vendor deemed capable of delivering
50T (wonder where *that* number came from?) with "stable software" and
a "viable interconnect".
So the US spent however many million it was...not for protein science,
not even for the most over-analyzed weapon in all of human history,
but so it could one-up Japan.
Pfft, yourself. ;-).
>I've worked on a (very few)
>apps that I'd kill to run on hardware like that. With
>any luck lots of other people will think like you and
>I'll be able to pick up a box or two from the skip in a
>couple of years time. Hell, if it runs as cool as some
>of the claims I've seen it'll be a perfect box to run
>next to my PC. :)
>
What you will discover is that you will be unable to pay the
electricity bill, and, even if you were, by the time one becomes
available to you at a reasonable price, you'd be better off buying new
and even more energy-efficient hardware.
>If these boxes are good for signal-processing, I'm sure
>that there could be a few big contracts for mobile/semi
>-portable applications.
>
Never said it wasn't a good box. I believe I said I thought IBM was
proud of the box, and they should be. Grabbing a bunch of them and
lashing them together at LLNL, however, served no purpose whatsoever,
though, other than to one-up the ES.
Now that you've provoked me, that money would have been MUCH, MUCH,
MUCH better spent on basic research to get a handle on how to program
these damn things and on something of the kind that DARPA would fund.
DARPA would be embarrassed by Blue Gene, too.
RM
<snip>
>
>I think the problem I see with your attitude on this issue is
>that the truly fundamentally hard problems seem to be to some
>large degree to be algorithms and software rather than hardware.
>
I've been around the mulberry bush at least once on this subject on
comp.arch. It is almost a religious belief with me that thinking
about hardware and software separately is the road to ruination. Even
if it's only conceptual hardware, as in Turing's amazing little
machine, you'd better have something concrete in mind.
>If there were a software pull, someone would push hardware
>in that direction.
The only example that comes to mind is LISP machines, which (and
someone will correct me if I'm wrong) I believe came to nothing.
Other than that, hardware has pulled software, at least in my little
universe of computational physics:
Vector algorithms.
Stream processing.
Cache blocking.
Instruction chaining.
Other than that, it's been a matter of trying to figure out how to get
hardware to do things that it doesn't do very well naturally, like
pointer chasing.
In the case of the current tired horse I've been flogging, it's how
many problems can you cast into a streaming format and what kind of
payoff is there for doing it.
>There is good reason to think that from
>a technical perspective, developing the sorts of software to
>do those problems better would be good, but it's proven rather
>hard in practice, including in PhD thesies and random researchers
>reaching out in freethinking directions. Even if it had to be
>hand-done to some degree, if it was doable for some of the problems
>there would be money right there for doing it.
>
I do not think that either Church or Turing would be pleased to see
the mess that has been made of their elegant formulations. I don't
think it would hurt at all to invest in some mathematical
street-cleaning. If enough of that kind of work were funded, another
Church or Turing would stumble along soon enough.
>Throwing money at hardware problems to run the known algorithms
>in larger parallelism works, to a predictable degree.
and unless you can get a qualitatively better result, you are just
throwing money away.
>To make the
>great leaps, we need some great leaps in concept and algorithm
>which have as of yet at least largely eluded humankind as a whole.
>That is not the sort of problem that you can solve with money.
No, it's better to let starving mathematicians do it on their own so
that lesser figures with swollen egos can come along later and get on
the cover of Time magazine.
>It's waiting the right bright person and the right supporting
>developments for them to make the right insights.
>
And it is somehow better to throw huge sums at arbitrarily large
machines than it is to fund smaller scale but more risky research?
>The question is; given the value of some of those problems,
>is it worthwhile for society to spend money on the ugly hack
>way we do it now being scaled up now, or should we just not bother
>and wait for the great leap that may come?
>
Blue Gene was a knee jerk, and pretty much wasted, response to an
external stimulus. Somebody needs to be embarrassed.
If you can't actually *do* the real problem (e.g. protein folding),
spend your time on paper studies and wait for the hardware to catch
up, which it will.
>The government wants to do some of those things now, and commercial
>companies are making money off doing some of those things now,
>so I am guessing that the value of those problems is worth the
>incremental improvements we can make now.
>
The US is saving itself the embarrassment of being No. 2 in an
important technology as a direct result of being letting the PC market
fund processor research for over a decade.
>Banging your head on the wall and demanding that the next
>Einstein, Feynman, Hawking pop up and solve 'the problem' is
>not a reasonable strategic plan 8-)
>
I don't bang my head against the wall, and I'm not a particular
admirer of any of the three figures you named.
Sooner or later, Eugene Miya will reappear and remind me that, after
all, there are some things that just can't be done, and I will again
respond that, if you are persistent, you may not get an exact answer,
but you can come as close as your patience and persistence permit.
RM
>If you want to use "our civilization's vast economic surplus" to stop
>people from dying from things like cancer, AIDS, malaria, I think
>there's no need for faster computers or better molecular biology.
>At least for AIDS and malaria, it has to do with hygiene and education,
>which we know very well how to provide. I.e. the problem is political.
There are plenty of diseases for which political solutions won't get
very far; but let's take the two you mentioned, which indeed could for
the most part be dealt with politically. Political problems tend to be
more intractable than technical ones. Would you really want to live in
a world that would be willing and able to enforce a ban on promiscuous
sex, even in countries that don't want such a ban? Can you think of a
way to get the environmentalists to not fight mass spraying of
insecticide in malaria-infested areas?
I'm inclined to think figuring out the molecular basis of these
diseases so we can develop better treatments is actually an easier
problem to solve.
--
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
http://www.esatclear.ie/~rwallace
>They're not even clever enough to cover their own tracks. The meeting
>was called
>
>"Earth Simulator Rapid Response Meeting"
>
>Was the US under attack from Japan in April 2002 and were they using
>the Earth Simulator to do it?
Would you rather they were?
Sixty years ago, the Americans and Japanese were competing by dropping
bombs on each other. Now they're competing by building faster
computers to help solve humanity's problems. This strikes me as a very
good thing.
>Now that you've provoked me, that money would have been MUCH, MUCH,
>MUCH better spent on basic research to get a handle on how to program
>these damn things and on something of the kind that DARPA would fund.
I think we should do both. One thing history tells us is that you
really can't predict which approach will give the best results. Often
it happens that progress mostly comes from a lot of incremental
advances of the kind Blue Gene represents - so we should be happy
about it, not complain it's not as revolutionary as we'd like.
Sometimes someone comes up with a brilliant idea out of the blue that
actually works - so we should keep looking for that too, but we
shouldn't put everything else on hold in the hope of finding it.
In particular, someone else remarked that the sort of software
optimizations we'd need to program a dataflow machine effectively at
reasonable cost in programmer time, are related to the ones we need to
improve utilization of existing hardware. That strikes me as worth
following up on; what exactly are the obstacles to making existing
number-crunching code more stream-like?
> +---------------
> | My guess is that compilers would end up forced to stick yield
> | instructions into nearly every basic block...
> +---------------
>
> Not necessarily -- only those blocks which end with a *backwards*
> branch. For branches that implement an N-way fork/merge flow path,
> the compiler can just keep track of the maximum duration of any
> fork, for example. [But also see the above-mentioned papers on
> deferring other kinds of tests, which cover the case when one fork
> is much more expensive than the others.]
>
> But even if you did have to...
In similar vein the Transputer treated unconditional backward branches
as places where the CPU could be rescheduled to another thread.
(Which is why the registers (A,B,C) were documented as not being preserved
over a backward branch).
IIRC these were the points at which time-slicing occurred.
--
-- Jim
James Cownie <jco...@etnus.com>
Etnus, LLC. +44 117 9071438
http://www.etnus.com
It was proposed, quite earnestly, early in the public history of AIDS in
Sweden. While there was the obvious outcry arguing based on human rights,
I'm convinced the proposal died for quite another reason: There simply is
no way to achieve statistically valid results at current non-African
levels of prevalence - the usual hypothesis-testing dilemma of balancing
false-positives vs false-negatives. Currently, an "AIDS test" actually is
two tests in sequence, with carefully balanced error rates. Of course, you
can tolerate substantial false-positive rates for things such as blood
and blood products, but they are another issue.
Jan
Where that is necessary, yes they do (e.g., when operating on bones). But
remember that a lot of your body is anything but sterile - it makes no sense
to have clean room conditions when opening up intestine, for instance.
Jan
The UK has the means, but it requires a declaration of a State of
Emergency. If the current anti-Terrorism law goes through, that
obstacle will be removed.
There are several ways of tackling the problem you mention, but I
agree that all are problematic.
Regards,
Nick Maclaren.
Yes they have tried. Whether it would "work" in any useful way is highly
doubtful. The basic principles of infection have a strong tendency to get
in your way. Unless you manage to build an organism with both high letality
_and_ high latency-to-disease, you won't succeed - and that's were those
basic principles get in you way.
Even Yersinia pestis, at its height, only managed to kill about a quarter
of the population in urban European areas.
> 4. There are now bacterial infections that are incurable because there
> are strains of bacteria that are resistant to all known drug
> therapies. It is not impossible to imagine reaching the point that we
> will simply not be able to operate hospitals because they are such
> effective repositories for such bacterial strains.
Almost all of that problem is of medicine's own making - undisciplined
use of its tools. The US of course is a leader here as well.
> While I don't want to turn this into a tussle with Mr. Hinds or anyone
> else about what is and is not possible, the fact is that we know the
> ab initio equations that apply, and we know how to solve them. The
> kind of grimy guesswork that Mr. Hinds is familiar with is
> necessitated not by any lack of fundamental understanding, but by a
> lack of raw computing power.
That's were I disagree most. There's a lack of interest and will in
attacking these kinds of problems. We do _not_ know enough, we have very
little fundamental understanding of what is going on in human bodies.
Hell, IMNSHO it's more complicated by orders of magnitude than gaining
an understanding of society and in particular economics - and look where
the state of the art in that subject is! Adiabitic deviations from thermo-
dynamic quilibrium under the assumption of global, almost perfect knowledge
will get you a Nobel prize...ugh.
Jan
Oh, I think most here will agree with that.
But the development needs to be driven by the application needs, top-
down, not from the hardware side, bottom-up.
The real lack, both in people, interest and policy, is in interdisciplinary
research. Look at who gets tenured professorships and grants, and who
doesn't. Science is turning ever more fractionated and isolationistic.
(Yes, there's lot of talk, but the actions show clearly that the talk is
just facade.)
Jan
<snip>
> 2. Infectious diseases arise as if out of nowhere and because of
> modern transportation girdle the globe in a matter of days. It is
> astonishing that HIV hasn't transformed itself into something more
> durable, virulent and deadly because of the frequency with which it
> mutates and reproduces itself. It seems as if the SARS epidemic was
> an exceedingly close call.
There's a relationship between how fast a disease develops and how far
it propagates: The slower the development the further it spreads before
it is detected.
SARS is a good example of this. Symptoms develops quickly and people are
put in quaranteen. Heck even the chinese could manage to get SARS under
control. So the fear for a flu-like HIV killer is, IMHO, an overreaction
(at least from a global health point of view). Another example is ebola,
spreads pretty much on touch (body fluids though) and kills quickly ->
carriers get wiped out quickly.
HIV is pretty much as bad as it gets. The combination of a long (years)
incubation period, combined with the fact that it is infectious very
soon after infection allows it to spread wide. The only thing that
compares is the plague which incidentally had a similar pattern
(although the timescales are much smaller).
Martin
Being able to patent a substance as such, instead of the procedure of making
and/or using it, is a disease of the US patent system, unfortunately slowly
spreading elsewhere. I do hope a cure will be found in time.
Jan
CDC 6600 and the PPUs, right?
Jan
Another example: the transputer. There, the "yield" instruction is a
a jump or any comms instruction. Although it was theoretically possible to
lock out other processes at the same priority by running "non-yielding"
code, I've never heard of a case, and the compilers certainly didn't care
- "normal" code contains enough of the above set of instructions...
And, of course, all the other benefits mentioned elsethread, in particular
fast context switch times, were realised on the transputer: After such
instructions, the evaluation stack was documented as being unpredictable,
so that the (hardware) scheduler only had to save one word (the PC in a
slot of the workspace provided for that purpose).
Jan
<snip>
>
>> While I don't want to turn this into a tussle with Mr. Hinds or anyone
>> else about what is and is not possible, the fact is that we know the
>> ab initio equations that apply, and we know how to solve them. The
>> kind of grimy guesswork that Mr. Hinds is familiar with is
>> necessitated not by any lack of fundamental understanding, but by a
>> lack of raw computing power.
>
>That's were I disagree most. There's a lack of interest and will in
>attacking these kinds of problems. We do _not_ know enough, we have very
>little fundamental understanding of what is going on in human bodies.
>Hell, IMNSHO it's more complicated by orders of magnitude than gaining
>an understanding of society and in particular economics - and look where
>the state of the art in that subject is! Adiabitic deviations from thermo-
>dynamic quilibrium under the assumption of global, almost perfect knowledge
>will get you a Nobel prize...ugh.
>
You are overstating what I wish to claim is possible. Understanding
the folding of an individual protein is the first step. Understanding
how the protein functions within a cell...never mind how it interacts
with the entire human body...that is beyond even contemplation at the
moment, although looking at isolated interactions of proteins does
seem like a plausible and doable next step.
The kinds of protein models that are most complete at the moment use
classically mechanical models of atomic interactions and heuristic
rules about how bonds behave when they are subjected to bending
moments or torsion and potentials for long-range interactions.
The statement is made that quantum mechanics are important only when
covalent bonds form. That latter is a statement of necessity and not
of fact. No one knows when qm is important and when it is not because
very few of the calculations have been done--none, in fact, that I
know of for a complete protein.
The semi-classical approach gets you into an inexact, time-consuming,
and contentious program of research: guess, compute (possibly for
years), compare with experiment, have a meeting and argue about it,
guess again, ad nauseum, with no guarantee that the process will ever
converge or if it does (most likely because people are simply worn
out) that it has converged to the right answer. Much better, and I
believe possible, just to solve the correct ab initio equations,
probably making some acceptable approximations about the behavior of
inner shell electrons that clearly do not participate in bonding.
As to Nobel prizes...save them for someone younger who can enjoy a
life of pointless adulation.
RM
I suspect that what's *really* happened here is that there is
a super-unsexy or super-top-secret application that has been
languishing without funding. So, acting on behalf of themselves
or another organisation the DoE has put their case in terms
that aggressive xenophobic politicians can understand.
ie : "Their National Penis is Bigger than our National Penis".
The politicians get to parade their new bigger penis in public,
the DoE gets a pile of machinary to accomplish a task (whether
that's the publically stated one or otherwise). Everyone is
happy except RM. :)
> http://www.ultrasim.info/esrr_meeting/bland.ppt
>
> They're not even clever enough to cover their own tracks. The meeting
> was called
>
> "Earth Simulator Rapid Response Meeting"
>
> Was the US under attack from Japan in April 2002 and were they using
> the Earth Simulator to do it?
I figure it's just a usefully exploitable side-effect of the
biological reproductive urge to be honest. Scientists and
Engineers can be amazingly devious when they want to be. :)
[SNIP]
> What you will discover is that you will be unable to pay the
> electricity bill, and, even if you were, by the time one becomes
> available to you at a reasonable price, you'd be better off buying new
> and even more energy-efficient hardware.
Actually this is worth attacking.
Firstly from what I know about BlueGene/L it is modular.
I don't need to salvage an entire 64K node machine from
the dump. I could salvage a single rack which burns 20kW,
and if it gets too hot with the windows open I could power
down some shelves. Fingers crossed: if the nodes aren't
doing anything they may well draw bugger all current, so
I might not even have to bother powering stuff down. In
essence I have much finer grained control over the power
consumption than I would with large CPUs and faster wider
memory subsystems.
As for energy efficient... That's a substantial part of
the thinking behind BlueGene/L, they make some very big
claims on this score. *IF* their claims of energy efficency
are to be believed then I suspect BlueGene/L machines
could well have a longer useful service life than a pile
of big cores.
The failure rate they are aiming at for a 64K node machine
is 1 every 10 days. One node going down in this kind of set
up isn't necessarily a show-stopper. The approach they are
taking appears to make failure containment a lot easier
than in a SMP cluster type scenario. Looking at it in terms
of impact on production :
One node dies every 10 days (2 cores per node), so you lose
1/32768 of your peak throughput.
If you look at a 256 way big-core machine with 4 node SMP
boards you could well be looking at losing 1/64 of the
throughput every 10 days. That could hurt.
Of course that hit might well be offset by the difficulties
of harnessing 65536 mice to your plough. :)
> >If these boxes are good for signal-processing, I'm sure
> >that there could be a few big contracts for mobile/semi
> >-portable applications.
> >
> Never said it wasn't a good box. I believe I said I thought IBM was
> proud of the box, and they should be. Grabbing a bunch of them and
> lashing them together at LLNL, however, served no purpose whatsoever,
> though, other than to one-up the ES.
I imagine it opened some wallets and gave a number of
politicians' penises some fresh air. Willy Waving makes
the world go around unfortunately. :(
> Now that you've provoked me, that money would have been MUCH, MUCH,
> MUCH better spent on basic research to get a handle on how to program
> these damn things and on something of the kind that DARPA would fund.
> DARPA would be embarrassed by Blue Gene, too.
Personally I figure that these things are useful enough
already without finding new applications for them. It
would be nice to find new apps, but they don't *need*
them to justify their existance IMO.
There have been some exceptions to that, such as the
Connection Machines... Hardware looking for an app if
ever I saw it. :)
Cheers,
Rupert
Robert, you need to tone down the messianic pronunciations a little. You
are not the only smart person looking at these problems. Just because you
don't agree doesn't mean the decision was wrong. And in quite a few cases
it really isn't a zero sum game.
In my own field I have on several (numerous?) occasions realized, in
retrospect, that things that were done or not done turned out to be
reasonable decisions even though at the time I thought they were dumb. And
sometimes things I thought were great ideas turned out to be dumb.
So have a little humility, try to be objective.
With all due respect,
del
The argument that computation power is limited by flops seems bogus to
me. It would lead to the conclusion that all the work in, say, sorting is
in the comparisons, and that the permutation is inherently free. Or, to
take another example, compilers spend most of their time creating data
and moving it around. Does that mean that compilation has potentially
zero energy cost?
I should probably read one of those papers on the inherent energy cost
of computations...
--
David Gay
dg...@acm.org
>
>Robert Myers <rmy...@rustuck.com> writes:
>> If you believe that joules per flop is the ultimate limiting factor,
>> then, yes, movement of data on the chip is the performance limiter.
>> Movement from a functional unit to a register accomplishes nothing,
>> and, as feature sizes shrink, the energy cost of moving the data will
>> exceed the cost of performing the arithmetic.
>
>The argument that computation power is limited by flops seems bogus to
>me. It would lead to the conclusion that all the work in, say, sorting is
>in the comparisons, and that the permutation is inherently free. Or, to
>take another example, compilers spend most of their time creating data
>and moving it around. Does that mean that compilation has potentially
>zero energy cost?
>
Oddly enough, even though flops are the wrong unit to be using when
talking about calculations that are not floating-point intensive, the
argument goes though the same way. Just change the sentence to:
If you believe that joules per useful unit of work done is the
limiting factor, then movement of data on the chip is the performance
limiter.
And, no, the conclusion is not that compilation (or other similar
tree-traversing activities, like AI) will be free. Quite to the
contrary, it says that those kinds of applications will be more
resistant to low-energy strategies than traditionally power-hungry
floating-point intensive calculations, where, more often than not, you
know how to exploit code and data locality much more effectively.
RM
>Possibly so, but I haven't engaged in the kind of overblown rhetoric
>the DoE has to justify this machine, whose only real purpose is to
>knock the Earth Simulator off the top spot
>
>http://www.ultrasim.info/esrr_meeting/bland.ppt
I think my major concern around this whole situation is that the
purpose of the Earth Simulator doesn't seem to have been "to knock
ASCI white off the top spot". I think the really interesting product
of the Earth Simulator so far isn't the machine, or the
fourteen-figure Linpack flops figure, but the annual report
http://www.es.jamstec.go.jp/esc/images/annualreport2002/index.htm
of the kinds of things people are doing with it. I haven't seen
anything remotely comparable for the ASCI machines; of course the
major application for the ASCI machines does not produce publishable
papers, but I haven't seen even references to a compilation of
"interesting unclassified things done on the ASCI systems".
I may just be ignorant, it may be that the way funding bodies in the
US work mean that interesting calculations done on ASCI would be
dispersed across high-impact-factor journals rather than compiled as a
single annual report; if the "interesting work" compilation exists,
I'd very much appreciate a reference to it.
Tom
>
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:2m5lsv436n5nsmr6b...@4ax.com...
<snip>
>> >
>> Never said it wasn't a good box. I believe I said I thought IBM was
>> proud of the box, and they should be. Grabbing a bunch of them and
>> lashing them together at LLNL, however, served no purpose whatsoever,
>> though, other than to one-up the ES.
>>
>> Now that you've provoked me, that money would have been MUCH, MUCH,
>> MUCH better spent on basic research to get a handle on how to program
>> these damn things and on something of the kind that DARPA would fund.
>> DARPA would be embarrassed by Blue Gene, too.
>>
>You need to get your timeline straight. If Blue Gene/L is running today,
>the design was started long before 4/2002. One doesn't need a crystal ball
>to figure out that all this bio stuff would be a huge market, nor to look at
>the calculations to be made and define systems that folks knew how to build
>and knew how to program that could begin to do those calculations.
>
>Robert, you need to tone down the messianic pronunciations a little. You
>are not the only smart person looking at these problems. Just because you
>don't agree doesn't mean the decision was wrong. And in quite a few cases
>it really isn't a zero sum game.
>
>In my own field I have on several (numerous?) occasions realized, in
>retrospect, that things that were done or not done turned out to be
>reasonable decisions even though at the time I thought they were dumb. And
>sometimes things I thought were great ideas turned out to be dumb.
>
My only conern in writing the post to which you are responding was,
"Del is going to take this the wrong way."
If there is a huge commercial market for computers to do biotech
calculations and IBM knows how to make and sell them, then having our
national laboratories buying the biggest and the bestest of what IBM
is selling anyway just doesn't serve any purpose other than making
sure that our national labs have the biggest and the bestest.
Of course I know the project was started before 2002 because I cited a
1999 IBM whitepaper laying out a course of research for such a box.
The meeting in April 2002 was icing: they had to make sure that they
lashed together enough processors to be able to one-up Japan. Since
the solution is scalable, they could have skipped the travel expense
and meeting time and just gotten out a calculator to figure out how
many of IBM's boxes they had to buy and wire up.
I don't have a beef with IBM. I have a beef with the DoE. It's true
that this is comp.arch, not comp.national.policy, but this happens to
be a national policy question of great interest to many of those who
follow comp.arch.
As to it being a zero sum game: it's true, if they hadn't spent the
money on some big high-profile program that would top the Top 500 and
produce lots of color plots, they probably wouldn't have spent the
money at all.
Messianic or not, humble or not, I'm standing by my position: spending
money on monsters like Blue Gene is a waste. It won't make any real
progress on the problem it purports to be aiming at (I'm repeating
myself), and it would be *much* better to have tossed a few of those
boxes out to universities and let them figure out how to cope with a
few thousand of what Rupert refers to as mice, rather than putting
65000+ of them together to see what comes of it, other than a
meaningless Linpack score.
>So have a little humility, try to be objective.
>
I have the humility to know that when I say something blunt publicly,
I am taking the risk of being proven wrong just as bluntly and just as
publicly.
RM
Grrk. We all know that the design was started well before then - it
has been said in public often enough. But that is not the only point.
One DID and DOES need a crystal ball to know how to define appropriate
systems, because it was and is clear that most such calculations are
being done using ghastly methods. That means that there are likely
to be massive gains (perhaps tenfold, perhaps a thousandfold, perhaps
much more) in sorting out the algorithms.
That doesn't mean that Blue Gene/L isn't a reasonable project, but it
does mean that its design was much more based on guesswork than careful
analysis of the requirements. Which military strategist said something
like "Any action, even if it turns out to be incorrect, is better than
no action at all"?
Regards,
Nick Maclaren.
I don't get this.
You're pissed off because the notional declassified cover use
is inefficient, not unproductive but inefficient?
One, there are clear national security reasons to be building
boxes like this, as has been repeatedly pointed out.
Sufficiently accurate simulations of nuclear weapons to
avoid the need for testing, simulations that can include
defects and aging effects and the like, are a very important
national priority.
Two, boxes like this can do other types of work than protein
folding, in terms of FEM and fluid flows and weather analysis
and all sorts of other useful and neat things. Bigger boxes
can do it more accurately, or faster, or both.
Protein stuff was not the only reason to build them.
Protein stuff is funding, both commercially and in research/academic
environments, some large computing facilities. The people doing
that work are finding computational proteomics to be a field that
is extremely valuable to advancing theoretical research and commerce.
They are finding it to be an enabling technology even though it's
stuck with using obviously inefficient algorithms at this time.
Systems like Blue Gene will work just as well at brute-forcing
the protein folding and related problems as smaller systems are
at this time. You seem to be offended that we're attacking the
problem by brute force rather than finesse. That's fine,
as far as it goes. Finesse is always useful. But finesse requires
someone having finessed the problem so we can do it the better
way already.
But your value judgement of the low value of brute-forcing the problem
is not matched by the demonstrated academic, research, and commercial
user base's experience. I repeat: Experience. They are not buying
systems to have labs full of cool blinkie lights. They're doing their
work with it, producing papers and products.
I was involved in building and supporting one commercial
computational biology project, and did proposals for two
others in the hundreds of nodes range, not world-class by
current standards but large. A noticable fraction of
the workdays of the researchers are being spent waiting
for computational results to come back. While spending
tens or hundreds of millions of dollars didn't make sense
for any of these customers, there are projects going on in
the field that will justify the use of systems of the
scale of Blue Gene.
That we theoretically should do much better eventually
once we figure out how to finesse the algorithms is of
no import to the calculation of current value of doing
current computational projects in those fields. If it is
worth doing now, with today's technology and algorithms,
then it will be done. And it's being done.
If there is money to be made now, and science and academic
careers to be advanced now, using today's technology and
algorithms, why on earth should people delay working
on the problems with today's solutions just because we
know that eventually, hopefully, we'll be able to do it
so much better and faster?
-george william herbert
gher...@retro.com
There's some stuff at:
http://www.sandia.gov/ASCI/
...but you seem to be right, there isn't a lot of focus on
the unclassified stuff being reported in summary to show what
sorts of science it can do.
-george william herbert
gher...@retro.com
>In article <2m5lsv436n5nsmr6b...@4ax.com>,
>Robert Myers <rmy...@rustuck.com> wrote:
>
>>Possibly so, but I haven't engaged in the kind of overblown rhetoric
>>the DoE has to justify this machine, whose only real purpose is to
>>knock the Earth Simulator off the top spot
<snip>
>
>I may just be ignorant, it may be that the way funding bodies in the
>US work mean that interesting calculations done on ASCI would be
>dispersed across high-impact-factor journals rather than compiled as a
>single annual report; if the "interesting work" compilation exists,
>I'd very much appreciate a reference to it.
>
The portal to its unclassified computing the DoE wants you to use is
which will lead you to the Office of Science
There's lots of stuff they're proud to tell you about, but none of it
involves what's happening on the machines at classified sites.
http://www.sc.doe.gov/Sub/speeches/Congressional_Testim/7_16_03_testimony.htm
Testimony of Dr. Raymond L. Orbach
Director, Office of Science, U.S. Department of Energy
before the U.S. House of Representatives Committee on Science
July 16, 2003
is particularly revealing. He talks about scientific applications
needing 50-100 teraflops, then announces that *10%* of NERSC's *10*
Teraflop machine will be available on a peer-reviewed basis for
unclassified scientific research not directly related to DOE programs.
There's that 50 Teraflop number again! And, of course, he mentions
the Earth Simulator. But notice, garden variety scientists will never
see 50 Teraflops or anything like it. Blue Gene/L's Linpack score is
a scam. It allows the US to claim that it has reclaimed the world
leadership position in supercomuting (pfft!), while providing 1
Teraflop of average capacity for the world scientific community to
access.
Pardon me while I use the restroom.
RM
<snip>
>
>If there is money to be made now, and science and academic
>careers to be advanced now, using today's technology and
>algorithms, why on earth should people delay working
>on the problems with today's solutions just because we
>know that eventually, hopefully, we'll be able to do it
>so much better and faster?
>
Sometime within the last year, Andy Glew raised the question on
comp.arch of what was holding us back in the area of parallel
computation.
*No one* mentioned a lack of processing power. The consensus was: a
lack of solid, workable *software* tools. That on a forum dedicated
to hardware.
In a different forum, someone wrote in to ask about how he could get
his biotech matlab calculations to run faster because each was taking
a week.
I suggested he consider a cluster. He posted back that parallel
Matlab was a subject of research, not a tool for working scientists,
that the program he was using represented man-years of development
effort, and that he had neither the skill nor the time to retool it
for use on even a two-processor machine. He used his two processor
box to run two instances of the program, so he could get two separate
simulations done each week.
I did a little googling and found that someone had actually
implemented the program on a beowulf cluster and gave him the link.
He was amazed that someone had done such a thing and that it could be
found so quickly, but, as it turns out, the beowulf implementation I
had found was only a partially functional version of the program and
did not have the functionality he needed.
A petaflop machine would not help that poor fellow get his Ph.D.
thesis written one day sooner.
If the DOE, or anyone else, were providing a realistic level of
funding for basic research in parallel computation, I would not be so
offended at their throwing however many million at just another big
machine.
As it is, basic research on parallel computation in the US is not
being funded at anywhere near the level it should be, and I _am_
offended by projects like Blue Gene/L.
If you've got a neato idea, and it will make zillions of bucks in
biotech, go get some venture capital and do it. If your idea is for
real, you'll find the money.
If you want to *understand* parallel computation and develop tools so
that others can do their work more easily, be prepared to make a
selfless sacrifice to humankind's fund of knowledge.
In the process of understanding how to program parallel machines, you
will, nearly for free, get alot of insight into how to build better
parallel machines.
For all the bilge and bother of this thread, I'm left with a very
basic question: can you get the energy performance out of a classical
architecture that you can get out of a streaming architecture. Two
posters have stated without proof that they *know* that a streaming
architecture won't beat a classical architecture on realistic code.
I'm glad that they are possessed of such a profound and instantaneous
grasp of all that is possible in computation. I'm a little slower,
and I think most of the rest of the world is, too, and I'd like to see
a little more money go into questions like that and a lot less into
high Linpack scores.
RM
There is the "wrong turn" of spending too much time on your tools, and
not getting your real work done. There is the equally wrong turn of not
spending enough time on your tools, and this guy belongs in this category.
As you say, he couldn't even be bothered to search for possible parallel
implementations of what he needed. If he was working on a PhD, he should
be changing advisor, as his is not doing the job properly.
> If the DOE, or anyone else, were providing a realistic level of
> funding for basic research in parallel computation, I would not be so
> offended at their throwing however many million at just another big
> machine.
I believe they did, but not very much came out of it. It takes very
strong leadership and high expertise to get the correct balance in such
a program of research - be experimental, follow new ideas, but don't
fall for fads. See David de Nucci's reported experience in finding
funding. And such a program needs to be backed up with sufficient hardware
support.
> If you've got a neato idea, and it will make zillions of bucks in
> biotech, go get some venture capital and do it. If your idea is for
> real, you'll find the money.
In the current economic climate? No way no how.
> For all the bilge and bother of this thread, I'm left with a very
> basic question: can you get the energy performance out of a classical
> architecture that you can get out of a streaming architecture.
Define "classical".
> Two posters have stated without proof that they *know* that a streaming
> architecture won't beat a classical architecture on realistic code.
From your description, "streaming" basically means re-using operands as
much as possible, by passing the result(s) of one FU to the next FU with
as little intermediate storage as possible. What the "two posters" told
you was that current microprocessors already do that if the programmer
can make it happen, and the energy cost of reading and writing registers
and L1 is negligible compared to other costs.
So the difference should be made at intermediate and long distances of
communication - L2/L3, off-ship/memory, etc. But then, we argue, BlueGene
is already a strong departure from the model of the "classical" micro-
processor - whatever that may mean in detail. As was the transputer,
for instance, or the ICL DAP. Or, indeed, the MasPar, the CNAPS, and
the venerable CM-1.
And we agree that the real problem is on the software side. But I, at
least, disagree that it's a problem of theoretical research: you need
to get people involved with the new ideas and the skills in models of
parallel computation, _and_ the people developing the applications- but
see above on the readiness of the application side to get involved.
All too often in computer "science", things have been tried on a toy
scale and found to be "good" (at least good enough for yet another M.S.
or PhD), but to collapse on first contact with the "real world". With
a background in image processing, I can tell you that this is a rampant
disease. Also see my previous remark about the perceived value of inter-
disciplinary research. You need to change the cultural values of these
areas (!) of science before you'll achieve anything.
Jan
You're running right past the point.
Large scale generalized parallel computation is a known hard
problem which is not solved. Known, acknowledged, and it is in
fact receiving a fair amount of R&D effort in the 'computer science'
corner of things. Plenty of people thinking about it.
Hard problem. Not moving very fast.
Shrug. People working on it. Bright people. Not my thing;
definitely valuable but it either happens or it doesn't.
Probably will. Hopefully sooner rather than later.
In the meantime: existing MPP and cluster systems exist,
and existing MPP and cluster software exists, and there are
applications out there using those systems and that software.
Applications which are doing important things for science,
making some people money, helping us avoid nuclear testing.
In many cases these are applications such as atmospheric
modeling or finite element analysis where the problem is
inherently very attackable by a MPP system, without
much software/algorithms pain and suffering.
Many of those applications run much better on bigger MPP boxes.
These systems are not being built and then sitting idle.
They're being used, in production, by real researchers,
real scientists, real engineers. Making real money and
producing real papers. And they'll make more real money
and more papers with more CPU power / more nodes / bigger
MPP boxes in a lot of cases.
Some of those applications that scale up nicely include
brute-forcing some of the things that we would really
prefer to find better parallel algorithms for.
But some of them do scale up as you increase the force
applied in a brute force approach.
>[...]
>For all the bilge and bother of this thread, I'm left with a very
>basic question: can you get the energy performance out of a classical
>architecture that you can get out of a streaming architecture. Two
>posters have stated without proof that they *know* that a streaming
>architecture won't beat a classical architecture on realistic code.
If you build a streaming architecture and can't program it,
the energy performance is meaningless.
If we could program it, we could program a lot of other things
more efficiently, the MPP systems would jump in performance,
and money would come available rapidly to build streaming
architectures if they were demonstrated to then be superior
using those better algorithms. But if their entire viability
is predicated on software and algorithm and methods developments
which have not happened yet, there is ZERO reason to start
working on hardware now.
>I'm glad that they are possessed of such a profound and instantaneous
>grasp of all that is possible in computation. I'm a little slower,
>and I think most of the rest of the world is, too, and I'd like to see
>a little more money go into questions like that and a lot less into
>high Linpack scores.
Those Linpack scores are not going into systems going
onto people's shelves to make nice blinkylight boxes.
For the most part, people are using them for real
hard problems, and they're solving things with them.
This is where you fail.
Those systems and problems do not encompass the whole
range we'd like to be working on. True.
But they're working on real, important problems,
and making real, important progress. And even though
it's inefficient and unoptimized, it works, and is the
best path forwards that we have available for the near
term for those problems.
You keep denying the existence of that subset of the
total problemspace which they're being used for.
You need to look around some more. They. Are. In. Use.
Building machines for users who use them is a
valuable activity. And it is not an activity that
should be put on hold pending speculative developments
on the theoretical side of computer science in making
parallelism work much better.
-george william herbert
gher...@retro.com
DARPA pumped money into this for years. The result was, essentially,
that everyone working on anything but tightly-coupled multiprocessors
built to run multiuser Unix workloads faster or special-purpose SIMD
machines threw up their hands, felt embarassed about how much money
they'd wasted, and went home. Seen a CM2 -- or anything like it --
lately? What's the last program *you* ran into that was written in
||c? Do you remember what DADO and NONVON were? The people working
in the area were no dummies (look what David Shaw's done since) and
they certainly didn't lack funding, but ultimately they beat
themselves up against some very hard problems for long enough, and
that was that.
The ubiquity of clusters, which amusingly share some of the same
constraints as the early massively-parallel machines funded by DARPA
in the 1980s -- a small fraction of the total processing power and
memory at each node, a relatively slow/high latency interconnect --
has led to a resurgence of interest in efficient parallel algorithms,
efficient techniques for programming parallel machines, etc. etc.,
but the problems are still just as hard, and the last time a huge
amount of money was pumped into this, the results were not all that
impressive. Having been around one of these projects the last time
through, I would not be so sure as you seem to be that the real
problem in parallel computing is that nobody is throwing money at
the software problems.
--
Thor Lancelot Simon t...@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
Norway also have the legal means, in the form of the 'Smittebærer' (lit:
Contagion Carrier) act instituted when tuberculosis was a big problem
(or should I say, the last time TB couldn't be treated reliably with
antibiotics?).
However, when the first wisper about using it against HIV/AIDS was
heard, there was such an outcry that it was politically DOA.
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
[SNIP]
> see 50 Teraflops or anything like it. Blue Gene/L's Linpack score is
> a scam. It allows the US to claim that it has reclaimed the world
> leadership position in supercomuting (pfft!), while providing 1
> Teraflop of average capacity for the world scientific community to
> access.
I'm sure IBM will sell you a 64K node machine if you ask
them. You might need to mug Bill Gates or something though.
Hell it might even be possible to persuede BillG to back
some research.
I suspect what would happen is that a bunch of guys with a
big pile of mag-tapes would turn up in black helicopters
and commandeer your machine despite your protests that an
American's Garage is his Castle. :(
Cheers,
Rupert
Even stronger: Since the P6, current cpus have worked as well as they do
by mostly forwarding results directly from one operation to the next,
without even making the detour via the register bank, much less L1 or
any other level of cache.
I.e. for the last 8 years, our cpus have more or less effectively turned
regular programs into dataflow graphs.
The increasing imbalance betweeen computation and transportation costs
(not just memory access, AKA "The Memory Wall"), in time & energy, will
force this trend to continue.
It is _not_ as if nobody cares about RM's main beef.
[SNIP]
> they'd wasted, and went home. Seen a CM2 -- or anything like it --
> lately? What's the last program *you* ran into that was written in
CM2s looked like freak machines from the the point of view
of this particular Transputer fan-boy. Interesting, but I
didn't feel they really made a breakthrough in scalability.
The history of Thinking Machines and the motivations behind
those machines make for interesting reading.
Unsurprisingly BlueGene/L gives this particular Transputer
fan-boy the warm fuzzies. :)
Cheers,
Rupert
>Sometime within the last year, Andy Glew raised the question on
>comp.arch of what was holding us back in the area of parallel
>computation.
>
>*No one* mentioned a lack of processing power. The consensus was: a
>lack of solid, workable *software* tools. That on a forum dedicated
>to hardware.
_I_ mentioned lack of processing power. I specifically said that as a
matter of empirical fact, for all the talk about the difficulty of
programming them, whenever people can get their hands on parallel
machines, they find good uses for them. I know I could happily make
use of a trillion-node machine if I could afford one.
>For all the bilge and bother of this thread, I'm left with a very
>basic question: can you get the energy performance out of a classical
>architecture that you can get out of a streaming architecture. Two
>posters have stated without proof that they *know* that a streaming
>architecture won't beat a classical architecture on realistic code.
I took them to be stating that if you have an algorithm that can take
advantage of a streaming architecture, it will also get a performance
boost on a classical architecture, so let's work on the algorithms
first, to make sure the streaming chips will be put to good use when
we build them. Which makes sense to me, and I'm still interested in
looking at what the difficulties are with doing this.
>I'm glad that they are possessed of such a profound and instantaneous
>grasp of all that is possible in computation. I'm a little slower,
>and I think most of the rest of the world is, too, and I'd like to see
>a little more money go into questions like that
By all means.
>and a lot less into
>high Linpack scores.
This is where we part company. If we never spent resources on actually
_doing_ things with the technology of the day (however crude and
inefficient it may be compared to what's ultimately possible), and
always waited for an efficient solution, we'd still be banging the
rocks together while we waited for someone to come up with a clean,
efficient way of making bronze.
--
"Sore wa himitsu desu."
To reply by email, remove
the small snack from address.
http://www.esatclear.ie/~rwallace
I've got some pills that can help you with that! National Max-VX!
Guaranteed to increase both the girth and length of your national
penis!
- Registered with the FDA and the EPA!
- As seen on Oprah and Jerry Springer!
- Minimum GDP (Gross Domestic Penis) size increase of 30%!
-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca
I was struck by the following paragraph.
--------------------------------------------------
More generally, thinking on algorithm and system design is still dominated
by the invalid perception that compute units are the most expensive resource
in a computing system. Applications are designed to minimize the number of
operations required to solve a problem. Systems are designed with enough
memory and I/O to achieve a high usage of the compute units, even if this
leads to low memory or disk utilization. In fact, the cost of systems is
dominated by storage cost, and by the cost of the logic required to move
data around (caches and buses). If one thinks anew from basic principles,
the conclusion will often be that algorithms that use more computing but
less memory, or use more computing but require less communication, coupled
with systems designed to ensure more effective use of memory and
communication, even at the expense of lower utilization of the processing
units, are more cost-effective than conventional algorithms and systems.
-------------------------------------------------------
This is from 1999. Is this the paper referred to earlier?
del cecchi
>In just rereading the IBM J. of Research and Development article
I referred to the document and provided a link to allen.pdf in a Nov
24 post responding to Patrick Schaaf. In my post I tried to give both
IBM and the creators of the document their due for being both candid
and thorough. I was critical of the document for not reaching the
conclusion that I think a presenter not eager to make a sale would
reach; viz, that while we know alot about the problem and we (IBM) are
very competent at designing big machines, the machine we (IBM) are
proposing is not equal to the problem we are proposing.
You should understand that the perspective I have been expressing is
not uniquely directed at Blue Gene/L. I have already made some very
strongly critical comments about the Space Shuttle. I was not all
that unhappy when the Superconducting Super Collider was cancelled
because, as I understood it at the time, there was a significant
chance that it would return a null result on the questions that
justified such a large expenditure on basic science: the top quark and
the Higgs boson.
I preferred then and I prefer now to rely more heavily on the natural
particle accelerator that the universe has provided to us and to get
as much instrumentation as possible above the atmosphere so we can
catch the universe in the act as it reveals its secrets to us without
our having to perform unnatural acts on terra firma.
Instead of people asking the question: how much time and money would
we need to do this right, they ask an oddly inverted question: given a
schedule and a budget imposed by considerations that largely have
nothing to do with science, what can we build and what can we
plausibly claim it will do?
The only magic that will suddenly appear at 50Teraflops is that Japan
will become no. 2 and we will become no. 1 again. That's not a good
way of managing the finite resources available to science.
To get back to the IBM document, I would have been very impressed if
an integrator of big systems had followed the paragraph you quoted
with another paragraph that said: "Since we know almost nothing about
how to construct algorithms that achieve what we have identified as
desirable goals, we therefore know almost nothing about how to
accommodate such algorithms with hardware. Before investing a large
sum in a small number of large systems, we should spend x years and y
dollars working with a larger number of prototype systems so we can
get a better handle on what kind of large system we really should be
building."
RM
Several times I have heard people say "of course, the limit,
the ultimate, is processor in memory - if the computation
were actually free, it would be done next to the memory
containing the data to be operated on".
I think this is NOT true.
Most computation involves more than one data item.
I think this implies that the optimum place to perform
a computation will be at something resembling the
centroid of the data items involved - weighted by
access frequency. With some allowance for treating
instructions as data. During the life of a comnputation
this "computational centroid" would shift over time.
Parallel computations would be performed at something
resembling a Viterbi constallation diagram.
---
Re streams: would you build a stream instrction set, or a
vector instruction set?
Obviously, vectors can be built on top of streams.
Vice versa is a bit harder.
Unfortunately, this is not true for streaming architectures,
and it is not true for reconfigurable logic.
Rather, the output of one functional unit feeds into a switching network
that eventually feeds into the input of the other. Where said switching
network, if you are going to have more than 4 integer ALUs on a chip[*]
probably involves at least 1, more likely 2, right hand turns - i.e. where
the interconnect length is probably 8-16 times longer than that involved
in a single integer ALU. Which, if you are being aggressive in clocking,
and which if you are limited by wire speed, could very easily translate
to 1 clock in the ALU, 4 cycles to any other ALU that is not directly
stacked on top of the first.
If talking about FMACs rather than integer ALUs, the ratio of compute
to communicate even in such a "FU to FU" microarchiture is closer to
1 to 1. Although my understanding is that some of the bio-informatics
and genomics codes are integer; although protein folding should be FP,
if based on real physical modelling.
[*] Actually, this > 4:1 ratio of communicate to compute only applies if
you have to communicate sideways. If you can arrange your ALUs
one on top of the other - I call this vertically stacked, because I draw
my datapaths top to bottom, although I am slowly learning to draw
them horizontally) then communication between neighbouring ALUs
can be fast. It usually isn't very feasible to bit interleave more than 4
wide
in such a stack. With clever layour, you could probably make the stacks
arbitrarily high; but you would need to get operands in at the side,
which implies either a right hand turn, or a register file or cache block
in the stack; in either case, something that must be routed over.
Every way I look at this, I end up with the goal being a squarish block
of computational logic - say something like a 4 high interleaved stack
of integer ALUs, 1 or 2 FMACs - separated from other computational
squares by storage and routing. Fast, 1 cycle communication between ALUs
in such a square; 4 cycle communication at best to other squares.
The exact number and shape of computational functional units that can be
squeezed into such a square varies according to your technology, as does the
ratio of compute to communication speed. 1:4 is middling; it's probably
closer
to 1:1 or 1:2 today for everyone except me (I'm willing to throw in tricks
like redundant arithmetic to make to compute faster). It'll probably be
more
than 1:4, approaching 1:16, in the future.
If you can keep all of your FUs busy all of the time, such ratios are
tolerable.
If not, then you can gain something by routing computation back to
the functional unit that produced the result (hmm, I suppose that's like
a recurrence). Enough slack, and it's not worth streaming from
one spatially distributed functional unit to enough at all.
I keep trying to figure out a formula for when it makes sense to
go to a spatially distributed computation architecture.
---
The above resembles a line of thought that I was preparing for the
Microprocessor Forum panel session that pitted me and a guy from Sun
(the CPU fogies) against advocates of spatially distributed computation
- Bill Daly of Imagine/Streaming Supercomputer, Doug Burger of TRIPS,
Robert Myklund of Ascenium. Unfortunately, the panel session was
just a soundbite fest; there was no time for technical discussion.
===
Don't get me wrong: I *like* streams. Have liked them since I encountered
them on William Wulf's WM-1 and WM-2. Where streams apply, I think
they make a lot more sense than thrashing an oblivious cache.
I also like the idea of "the output of one functional unit feeds right into
the
input of another". I'm just not quite sure how to arrange it, and make it
general purpose or reconfigurable, without losing a lot of efficiency.
For big enough problems the loss of efficiency is worthwhile, but if you
have
a narrow straw to main memory, I don't think it makes sense. Yet.
(If you have a big straw, or if you can stream out of cache - half of the
goodness of Imagine, IMHO - then maybe it makes sense right now.)
Not true. If the data is used soon,
it gets caught on the bypass path.
Typically more than half of all operands.
With a bit of work, using what I call
a bypass cache, more still.
Registers are not a performance limiter, because
they are bypassed. They may be wasting power.
Dead value elimination misses part of this.
Amen Brother Del! Or Brother Allen per Brother Dell.
The best way to design computers nowadays is to start
off from the memory subsystem, figure out the best cache
(or streaming, Brother Robert) hierarchy you can build,
match the register file, bypass network, and scheduler,
and only lastly to figure out how many ALUs and FMACs
are needed to consume everything that you can feed them.
The post by me that Robert refers to must have been one of those
times I dropped onto comp.arch for a bit, but never read the replies
to the post. (Like today: I'm posting because I am in a hotel room
with nothing else to do (and no daughter to insist that I play with her))
Anyway, first let me suggest, just because I like being the devil's
advocate, that it *is* lack of processing power that is holding us
back in the area of parallel computation. Specifically, the processing
power needed to analyze, on the fly and dynamically, non-parallel
user programs, and run them in the most parallel manner possible.
I'm deadly serious about this; but now let me talk about explicit parallel
software.
Jack Dennis (one of the fathers of dataflow) gave a workshop talk
at PACT in Paris circa 1997 where he said that the problem with
explicit parallel programming was that it exposes global resource
management issues to the programmer of LIBRARY modules.
I.e. it violates modularity.
E.g. you are on a 32 processor system. Your application won't
run well if it is parallelized to use more than 32 threads.
Say you have 4 threads to start with. Each calls a library routine...
but this library routine is easy to parallelize. If each of the instances
of the library routine on each thread invokes 7 more threads,
giving a total of 32 threads, great; but if each invokes 16 threads,
giving a total of 64 threads on a 32 processor system, you lose.
I think Dennis is 100% correct. Ease of software development
nearly always wins out over performance. Parallel systems need
to support software modularity.
There have been vague attempts to do this by making global
resource management information available to the library module
- how many threads should I fork? But I think the right way to
do this, the only way that stands a chance long term, is to virtualize
the parallelism - to create virtual threads that can be created and
destroyed cheaply and at will, which do not thrash each other
when run on a system with fewer physical CPUs.
Some threading software systems do this - create "task descriptors"
that are multiplexed onto the actual CPUs. Make the task descriptors
lighter even than a pthread like user level thread.
I think that there may be a role for hardware to play here.
Most of the other problems with parallel programming are of the same
ilk: explicit parallelism requires global management of resources,
and requires performance programming from the very start.
Some great minds in POK :-) are way ahead of you (maybe) with the LPAR
on zSeries. At least if I understand correctly, one can have thousands
of "virtual machines" or Logical Partitions on a processor with many
fewer physical processors.
I don't know how dynamic this is or the overhead.
del
>
I don't know about the government stealing your machine. They can buy
their own.
Come on. don't be so cheap. And if you don't have room in your garage,
I think we have just the spot or spots for it in Rochester.
<snip>
>
>Most computation involves more than one data item.
>I think this implies that the optimum place to perform
>a computation will be at something resembling the
>centroid of the data items involved - weighted by
>access frequency. With some allowance for treating
>instructions as data. During the life of a comnputation
>this "computational centroid" would shift over time.
>
>Parallel computations would be performed at something
>resembling a Viterbi constallation diagram.
>
The ADAM architecture described in Andrew "Bunnie" Huang's MIT Ph.D.
thesis attempt to do something like that. I've heard him give a talk
on it and read it more than once, but I have a ways to go before I
understand it well enough to give a summary better than what I think
is in the title; viz, you try to keep data and the threads using them
close to each other by migrating both.
>---
>
>Re streams: would you build a stream instrction set, or a
>vector instruction set?
>
>Obviously, vectors can be built on top of streams.
>Vice versa is a bit harder.
>
I'm not entirely certain I know the difference, in part because I'm no
longer certain I know what a vector processor is. On the old Cray's,
a "vector processor" was really just a pipeline processor. On
Itanium, at least, you can explicitly define a software pipeline, and
I'm not certain what to call the resulting process; is it a vector
process, a streaming process, or neither?
SIMD is more like what alot of people who didn't really understand the
old Cray's thought they were doing. You can stream SIMD instructions,
too. Is that a streaming process built on top of a vector
architecture?
From my point of view, the real issue got exposed in my exchange with
Bill Todd: single CPU's already understand the wisdom of the Merrimac
paper and can already do tricks that look an awful lot like streaming
or vector processing or both. It's when you put multiple processors
on a single die that the difference becomes apparent. The key step
seems to be to have the kind of bypass path connecting CPU's that now
apparently connects functional units on a single CPU. On a single
CPU, the bypass path allows you to bypass the register file. With
multiple CPU's, it would allow you to bypass shared cache. How to
make such a path available and how to tell the processors to use it is
well beyond my capacity even to hand-wave.
And, of course, if you have a large die, possibly so large that you no
longer expect to be able to reach all of it in a single clock, you
have to start worrying about the global movement of data among
processors. If you can arrange the calculation as a streaming
process, then organizing the movement of data is a problem that solves
itself: data might eventually migrate across the entire die in many
clock ticks, after having gone though multiple neighboring CPU's to do
so, possibly without ever having languished in cache, and certainly
without ever having required access to global bandwidth or global
cache.
RM
RM
"Brother" Fran Allen?
-- Dave
>> The Merrimac authors were not referring to bandwidth to memory or even
>> to cache. They were talking about bandwidth within the CPU itself.
>> In a streaming architecture there is no getting and putting; the
>> output of one functional unit feeds right into the input of another.
>
>Unfortunately, this is not true for streaming architectures,
>and it is not true for reconfigurable logic.
>
>Rather, the output of one functional unit feeds into a switching network
>that eventually feeds into the input of the other. Where said switching
>network, if you are going to have more than 4 integer ALUs on a chip[*]
>probably involves at least 1, more likely 2, right hand turns - i.e. where
>the interconnect length is probably 8-16 times longer than that involved
>in a single integer ALU. Which, if you are being aggressive in clocking,
>and which if you are limited by wire speed, could very easily translate
>to 1 clock in the ALU, 4 cycles to any other ALU that is not directly
>stacked on top of the first.
>
>If talking about FMACs rather than integer ALUs, the ratio of compute
>to communicate even in such a "FU to FU" microarchiture is closer to
>1 to 1. Although my understanding is that some of the bio-informatics
>and genomics codes are integer; although protein folding should be FP,
>if based on real physical modelling.
>
I'm going to play along gamely and expose my ignorance in the hope of
jollying my own understanding along, and perhaps that of a few others.
It seems to me that you are making a distinction without a difference.
If you are streaming, you stream to whatever you can reach in the next
clock. If it's another functional unit, great. If it's an element of
a switching network, so what? From my (naive) POV, the switching
network is just as transparent as the repeaters that are necessary
actually to get data to move any real distance.
The point is, you never put anything anywhere--cache or register--just
to sit and wait. It's always on its way somewhere, and you can
pipeline movement through the switching network just like you can
pipeline movement through functional units. What you are talking
about affects latency: how long it takes to get the first result to
pop out of the end of the stream, but it does not necessarily affect
throughput. Once you get the first result out, there is no reason
whatsoever that a properly designed streaming architecture cannot get
out a new result with every clock, no matter how many functional units
or network switches or repeaters you've had to go through to get
there.
As to communication cost, you can't do any better than the cost of
moving through the hardware you *have* to move through to get from one
end of the pipe to the other, and you don't *have* to sit in register
files or cache.
RM
My understanding is that creating the virtual machines is quite expensive.
And my impression was that noone publically characterized the workload
such thousands of "virtual machines" could easily handle. Did I overlook
some publication there? Meanwhile, I imagine thousands of webserver instances,
most of them getting about one request every minutes, or something like that.
i.e. something akin to "I can run 100000 threads in parallel, as long as they
all usually wait for some dummy to hit a key".
Sorry if this sounds negative; I'd love to read more detailled reports
than "you can have thousands of virtual machines".
best regards
Patrick
The allowance for treating instructions as data may
be unnecessary; assuming that programs aren't self-modifying,
the typical mode of large parallel processing seems to be
huge datasets being run with code that is sized on the
of order of magnitude of the L2/L3 E$ available now.
Distributing local copies of the instructions to the
sea of processors is probably cheap and not too difficult.
Even for something today like Blue Gene, 4mm^2 will get you
about a megabyte of EDRAM on CU-08 from IBM's foundrys.
Building using 150 mm^2 chips you should be able to fit
16 sets of (PPC440+big FPUs+1meg EDRAM) per chip.
Assuming that a goodly chunk of current working data
plus the code core fits within a megabyte, of course.
The problem sets would need finer grained analysis.
The point that you make about data and computation
locality within the dataset is a good one though.
And feeds in to a variant of the issue you bring
up in another posting, of parallelism and hiding
or exposing system resources. The layout of the
data sets within the system's physical memory
will hugely affect the behaviour of such computational
models... the decision of how to clump data,
and where to put specific parts of it, will be
a major optimization depending completely on
what the actual system topology and latencies
are for large parallel systems.
There are already people edging towards acknowledging
that from the OS and hardware side. I don't know what
other manufacturers are up to, but Sun's been talking
about enhancements in future OSes to enable the system
to migrate RAM contents closer to where it's being used.
-george william herbert
gher...@retro.com
Well, yes, but ....
Back in the mid-1970s, that was heresy - though it was actually first
predicted in the 1960s! But everyone with Half a Clue has known it
since the 1980s - though that doesn't dispute what the abstract says,
given the proportion of the Clueless :-(
However, the abstract contains one error of fact - storage cost is
NOT always the dominating factor, though I can see why an IBM article
would say that it is. It isn't rare for communication latency to be
the dominating factor in cost, and I am not just talking about HPC.
Also, I think that you aren't being radical enough. I would start
off with the memory access MODEL, because it is clear that the current
one badly needs rethinking. And this could easily require application
redesign, as that abstract says.
Regards,
Nick Maclaren.
Just so.
> If you can arrange the calculation as a streaming
> process, then organizing the movement of data is a problem that solves
> itself: data might eventually migrate across the entire die in many
> clock ticks, after having gone though multiple neighboring CPU's to do
> so, possibly without ever having languished in cache, and certainly
> without ever having required access to global bandwidth or global cache.
I expect you know the philosophy underlying multigrid methods...methinks
there are some similarities here.
Jan
<snip>
>
>I'm sure IBM will sell you a 64K node machine if you ask
>them. You might need to mug Bill Gates or something though.
>Hell it might even be possible to persuede BillG to back
>some research.
>
I'm sure your comment was made completely in jest, but, as far as I
can tell, Microsoft has lost the supercomputing market about as
completely as it is possible for a software vendor to be shut out of a
market.
If I wanted to sell my soul, Microsoft might *be* interested in
backing research on a supercomputer application that would wind up on
a computer using a Microsoft OS.
In order to do that in a way that wouldn't cut you off from almost all
of the world research community, you'd probably wind up doing
everything under Cygwin (with a corresponding loss of efficiency), and
I have a feeling they would catch on sooner or later.
Who knows. They might not care, just as long as the demo was flying
the windows banner. Have to give that some thought. ;-).
>I suspect what would happen is that a bunch of guys with a
>big pile of mag-tapes would turn up in black helicopters
>and commandeer your machine despite your protests that an
>American's Garage is his Castle. :(
>
I've often wondered if weirdos like me, with a serious background in
stuff they don't want people to know about, but no longer under their
direct watchful gaze, ever show up on their radar screen. Maybe I
should drum up a contract so the DIS can have a legitimate reason to
rummage around and confirm that all those kilowatts flowing into my
house as electricity and going out through the roof as heat are not
going into the growing of illicit substances or the result of other
subversive activity. As one correspondent recently remarked, it's all
enough to make you nostalgic for the days when September was just a
month, and not a state of mind.
RM
>Cheers,
>Rupert
>
http://www.microsoft.com/windows2000/hpc/default.asp
the Cornell Theory Center http://www.ctc-hpc.com/
http://www.tc.cornell.edu/
seems to do a lot of Windows based high performance
cluster computing.
Paid for by grants:
http://www.tc.cornell.edu/news/releases/2002/hps.asp
Not clear if you would call it supercomputing.
I find it interesting how much emphasis there is on
non-science and engineering computation,
such as financial risk analysis.
I ported PLAPACK without the use of cygwin, and many other tools were
native years ago. I imagine things have gotten better since.
http://www.microsoft.com/windows2000/hpc/toolkit.asp
--
George Coulouris
not speaking for ncbi
remove 's' from my address to reply
Doesn't have to be. The way VM/370 through z/VM are normally used, it is
relatively costly (create virtual config from directory entries, allocate
all relevant control blocks, IPL a guest OS etc.) -- but then again it is
the creation of a whole virtual computer, and isn't done very often. (I
should note that we're still only talking about a few seconds, including
guest IPL in many cases -- I wish other systems would come close.)
In the late 70s we had a modified VM/370 at Yorktown that supported spawning
of secondary virtual machines on the fly, using the parent's configuration
as a template -- sort of like Clone or Fork ("SPY" machines). This was quite
cheap, comparable to most other system calls (CP "diagnose"). Not as cheap
as user-level threads, but just what the doctor ordered for running subtasks
in their own, isolated, virtual machines.
As for VM overhead: problem-state (user-level) instrs execute at native speed,
and since the mid-80s (370/XA) so do the common cases of most privileged
instrs, thanks to SIE ("Interpretive Execution" mode, where hardware and
microcode have access to the guest state descriptor). The remaining overhead
depends on the guest OS, and (for my OS, which uses address translation and
multiple address spaces, but no CP services) seems to be a few %. I think
the overhead for CMS (no address translation, but significant use of CP
services) is similar, perhaps a few % higher. I don't know the numbers
for Linux or MVS-like guests. This small overhead is when resources are
not overcomitted, i.e. negligible host paging besides the quite efficient
block paging that may happen when an interactive VM is redispatched after
a long period of user "think" time.
Btw, the "thousands of guests" applies to VM/ESA or z/VM, not LPAR -- the
latter uses relatively static partitioning for maybe a dozen partitions,
possibly more on the newer machines, for essentially negligible overhead
for production-level guests.
Michel.
somebody said (sorry lost the attribution)
> >
> >OK, let's see some evidence that these radical architectural concepts
> >will actually work better on the problems people are interested in. I'm
> >an agnostic about wizbang new architectures.
>
> That's a fair challenge.
>
> RM
http://www.research.ibm.com/bluegene/BG_External_Presentation_January_2002.p
df
is a pitch that I just noticed, although it is not new. In there, on page
6, they show that there are two blue gene machines. One, blue gene/L is
that which is now partially running. that is, part of it is now running.
The other, blue gene/P for Petaflop is in the next generation cmos
technology and must be somewhat different to get to a petaflop from a mere
200 Teraflops.
Nothing earth shaking in the pitch, but sort of interesting.
del
In the days when most IBM employees still used VM as their primary computing
platform (early 90s), a large mainframe of the time could easily support
over 5000 simultaneously logged-on users, out of which several 100 would be
actively working at their terminals at any one moment, with sub-second
response times. There was also a "stunt" demo of 41,000 Linux instances on
one machine a couple of years ago, but that was not intended to be a realistic
use demo. But the model you mention -- thousands (not tens of thousands) of
independent Linux-hosted low-utilisation Web servers -- is a realistic use
in some environments, and I would think economically advantageous.
Careful management of sleeping resources is not a trivial task when wakeup
latency is an issue. (My dummy expects sub-second response time! With VM
I get it every day, even though the physical machine is 50 miles away. Oh,
that z/VM instance is in fact just sharing a physical machine, in an LPAR.)
Michel.
I have no idea why on earth anyone would want to do this (though I know
that Robert Myers has been beating this drum, too). Even if the smarts
existed to analyze non-parallel user programs and run them as parallel
as possible, the only reason I can think of to even try to do it "on the
fly" would be because it looks like hardware, and as Robert has also
observed, hardware is where the funding is going.
> I'm deadly serious about this; but now let me talk about explicit parallel
> software.
>
> Jack Dennis (one of the fathers of dataflow) gave a workshop talk
> at PACT in Paris circa 1997 where he said that the problem with
> explicit parallel programming was that it exposes global resource
> management issues to the programmer of LIBRARY modules.
> I.e. it violates modularity.
>
> E.g. you are on a 32 processor system. Your application won't
> run well if it is parallelized to use more than 32 threads.
> Say you have 4 threads to start with. Each calls a library routine...
> but this library routine is easy to parallelize. If each of the instances
> of the library routine on each thread invokes 7 more threads,
> giving a total of 32 threads, great; but if each invokes 16 threads,
> giving a total of 64 threads on a 32 processor system, you lose.
>
> I think Dennis is 100% correct. Ease of software development
> nearly always wins out over performance. Parallel systems need
> to support software modularity.
>
> There have been vague attempts to do this by making global
> resource management information available to the library module
> - how many threads should I fork? But I think the right way to
> do this, the only way that stands a chance long term, is to virtualize
> the parallelism - to create virtual threads that can be created and
> destroyed cheaply and at will, which do not thrash each other
> when run on a system with fewer physical CPUs.
a.k.a. variable granularity. The way to virtualize threads is to cut
them up into short-lived threadlets, and then make the overhead of
starting and stopping a threadlet, and moving data between threadlets,
VERY cheap--e.g. by removing any necessity to preserve state within the
threadlet itself, and the necessity to copy data to get it from one
threadlet to the next--yet still portable enough to run across
shared-nothing processors. With these sorts of traits, if multiple
threadlets end up on one processor, they effectively merge into a single
one--i.e. the granularity effectively increases. This is what I've been
saying for a decade, and the reasoning behind F-Nets and Software
Cabling (which are, in a way, a distant descendant of some of Jack
Dennis's work).
> Some threading software systems do this - create "task descriptors"
> that are multiplexed onto the actual CPUs. Make the task descriptors
> lighter even than a pthread like user level thread.
>
> I think that there may be a role for hardware to play here.
I do too, and I've already outlined it here a few months ago (e.g.
relating to cache scheduling and avoiding cache coherence and locks).
Some architects seem to think that the architecture should be trying to
figure this all out in hardware. There's no need and no use--just make
some processors available for software to do it.
.
> Most of the other problems with parallel programming are of the same
> ilk: explicit parallelism requires global management of resources,
> and requires performance programming from the very start.
Right, the programmer is asked not only to determine the operations to
perform and their order relative to one another, but how to partition
them among processors and then the order to run them on each processor.
Then, you put some compilers and OoO hardware in to remake many of those
decisions anyway, and now you seem to be suggesting that it try to make
even more of them. Better to just create better tools, languages, and
paradigms, rather than have the programmer fighting with the hardware
and compilers. An added plus: The programmer debugs the program s/he
wrote, rather than the one the compiler and hardware rewrote.
-- Dave
-----------------------------------------------------------------
David C. DiNucci Elepar Tools for portable grid,
da...@elepar.com http://www.elepar.com parallel, distributed, &
503-439-9431 Beaverton, OR 97006 peer-to-peer computing
And the memory system is dominated by packaging.
High-pin-density DIMMs are not terribly expensive, and my guess
is that the cost comes from using larger-than-usual DRAMs in
order to get the part count down. High-pin-density DIMMs with
a tall form factor allowing more DRAM chips would be cheap
enough that lots of businesses would be willing to spring for
them in a server.
So here's the question: can you mount SODIMM sockets onto the
CPU package? Not the chip itself, but the OFCPGA thingy. My
guess is that you could mount four SODIMM sockets without
making that OFCPGA any bigger. This does make the heat sink
a bit more challenging.
Next question: could you get four *144-bit* SODIMM-like sockets
onto the OFCPGA? Now you've got as many pins as a high-end
graphics card.
Next question: could you get 8 of those sockets onto the OCFPGA?
If the sockets have a .300" pitch, the OCFPGA is now 3.5" by 2.5".
And you get 1000 pins to your memory! Tons of bandwidth!
No, that addresses only the bandwidth issue. It certainly has
been done, and it works - no problem - but it addresses neither
the latency issue nor any of the obscurer ones introduced by the
memory access model.
Regards,
Nick Maclaren.
You aren't going to get some new paradigm of parallel computation software
by telling a bunch of PhD's to go into a room and figure it out. You are
going to get there by giving them large problems and large hardware and
allowing them to get it to work.
> As it is, basic research on parallel computation in the US is not
> being funded at anywhere near the level it should be, and I _am_
> offended by projects like Blue Gene/L.
>
Having sat in on meetings with people that make the purchasing decisions for
these machines, they know their workloads, they know what they need, and
they buy the most appropriate hardware for the job. There will be software
tools and knowledge that will come out of building applications for BG/L
that will impact both parallel hardware design as well as parallel software
design.
> In the process of understanding how to program parallel machines, you
> will, nearly for free, get alot of insight into how to build better
> parallel machines.
>
And Blue Gene/L is no different in this regard.
> For all the bilge and bother of this thread, I'm left with a very
> basic question: can you get the energy performance out of a classical
> architecture that you can get out of a streaming architecture. Two
> posters have stated without proof that they *know* that a streaming
> architecture won't beat a classical architecture on realistic code.
>
And I would agree. Having seen realistic code which is written in C and
Fortran, I don't give it much of a chance. If we can design a language that
will work well on streaming architectures, it will work better than C and
Fortran on current MPP machines. When we get that code base, we WILL design
and produce the hardware to exploit it.
As another tact...
If there is the money to fund a full infrastructure and production worthy
stream processors, then the cost of BG/L is a drop in the bucket. If not,
diverting the cost of BG/L to the work wouldn't make any real difference and
would penalize the science that will get done on BG/L.
Aaron Spink
speaking for myself inc
All together now:
There's a hole in my bucket, dear Liza, dear Liza, a hole ....
[ But, as you know, I agree that programming paradigms are the
place we should break out of the deadlock. ]
Regards,
Nick Maclaren.
I suggest Henry should put the stone into his bucket (with hole), and
carefully sink it down the well to wet it. Then he can sharpen his axe (I
would rather use a sickle to cut straw), cut the straw and fix the bucket.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
>
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:1aeosvcls60a5j7aa...@4ax.com...
>> If the DOE, or anyone else, were providing a realistic level of
>> funding for basic research in parallel computation, I would not be so
>> offended at their throwing however many million at just another big
>> machine.
>>
>They are providing that funding. You may not have the visibility into it,
>but a lot of people that have been associated or involved with ASCI have
>started to see the fruits of this labor. They have developed a variety of
>tools that are being used around the world ( just one example is beowolf
>clustering system ). Read some of the papers that came out of the ASCI team
>from SC2003. The work they did in modelling what the performance 'should'
>be versus was it was allowed them to figure out where the bottle necks were
>and fix them.
>
>You aren't going to get some new paradigm of parallel computation software
>by telling a bunch of PhD's to go into a room and figure it out. You are
>going to get there by giving them large problems and large hardware and
>allowing them to get it to work.
>
And you're going to get better science by hiding basic research behind
the iron curtains of our bomb labs? Give me a break. Seen too much
of that stuff. I'm not going to try any harder to make friends than I
already have (that is not merely a joke, that is a sardonic joke), but
somebody has to inject a little truth into these discussions. See my
remarks elsewhere about the 10 teraflop machine that isn't behind the
iron curtain and the 1 teraflop that's being made available to science
in general.
The mission of places like LLNL is to preserve its budget and its
institutional integrity (as in intactness, not honesty). If you don't
have the perspective to see that, it's because you're feeding at the
trough. LLNL is as non-proprietary and as competent as Microsoft and
produces the same quality of work with everything hidden behind a dark
curtain where only friendly eyes can peek.
Bitter because I haven't been invited to the party? Sure, if I had a
contract with one of the DoE labs right now, I'd have the basic sense
of self-preservation to keep my mouth shut. I don't, and I don't know
that I want to, and that gives me a certain freedom to speak with
candor that I haven't had at times when I did have a clearer line of
sight.
Now, maybe the national security interests of the United States are so
different from those of Japan that they can afford to do science in
the open and we have to keep redoing the Manhattan Project and don't
have money left over for anything else, but color me skeptical.
>> As it is, basic research on parallel computation in the US is not
>> being funded at anywhere near the level it should be, and I _am_
>> offended by projects like Blue Gene/L.
>>
>Having sat in on meetings with people that make the purchasing decisions for
>these machines, they know their workloads, they know what they need, and
>they buy the most appropriate hardware for the job. There will be software
>tools and knowledge that will come out of building applications for BG/L
>that will impact both parallel hardware design as well as parallel software
>design.
>
I have not the slightest doubt that they know what they need. Read in
whatever level of cynicism you care to.
>> In the process of understanding how to program parallel machines, you
>> will, nearly for free, get alot of insight into how to build better
>> parallel machines.
>>
>And Blue Gene/L is no different in this regard.
>
B******t. Is there something qualitatively different that happens
when you put hundreds of thousands of processors together than when
you put just thousands of processors together? If there is, please
tell me what it is, and I'll have learned something about parallel
computing that I can't otherwise imagine.
If you put all the money, all the resources, and all the
decisionmaking in one place, you get an institutional steamroller and
a dimwitted monoculture. Put a few more independent eyes on the
problem and spread the decisionmaking and the thinking out and you'll
get more results faster. Read that: you'll get results.
>> For all the bilge and bother of this thread, I'm left with a very
>> basic question: can you get the energy performance out of a classical
>> architecture that you can get out of a streaming architecture. Two
>> posters have stated without proof that they *know* that a streaming
>> architecture won't beat a classical architecture on realistic code.
>>
>And I would agree. Having seen realistic code which is written in C and
>Fortran, I don't give it much of a chance. If we can design a language that
>will work well on streaming architectures, it will work better than C and
>Fortran on current MPP machines. When we get that code base, we WILL design
>and produce the hardware to exploit it.
>
I think Nick already adressed the circularity of your advice nicely.
>As another tact...
>If there is the money to fund a full infrastructure and production worthy
>stream processors, then the cost of BG/L is a drop in the bucket. If not,
>diverting the cost of BG/L to the work wouldn't make any real difference and
>would penalize the science that will get done on BG/L.
>
B******t again. Spoken like a big iron guy. Dribble a few of these
boxes and a little money outside the iron curtain, and then I'll start
to sound reasonable.
RM
Sure there is. Running a single system becomes even more
tricky for starters. Then there are the partitioning (data
and code) issues... And then there is the fact that the
wrong $5 part flaking out could trash a significant amount
of performance - and that'll happen *frequently* too... :)
There are probably a hell of a lot more that'll bite your
ass that I haven't even dreamt of in my worst nightmares.
I'd love to play with a machine like that to pick up some
new nightmares though. :)
Cheers,
Rupert
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:0ei1tv8b5fl8uaohr...@4ax.com...
>> On Fri, 05 Dec 2003 09:30:41 GMT, "Aaron Spink"
>> <aaron...@earthlink.net> wrote:
<snip>
>>
>> B******t. Is there something qualitatively different that happens
>> when you put hundreds of thousands of processors together than when
>> you put just thousands of processors together? If there is, please
>> tell me what it is, and I'll have learned something about parallel
>> computing that I can't otherwise imagine.
>
>Sure there is. Running a single system becomes even more
>tricky for starters. Then there are the partitioning (data
>and code) issues... And then there is the fact that the
>wrong $5 part flaking out could trash a significant amount
>of performance - and that'll happen *frequently* too... :)
>
Nice try, but no cigar. With (say) 100 times more chunks of whatever,
you're likely to find a failure in 1/100th of the time. That's
quantitatively, but not qualitatively different.
We know so much about the architecture of the hardware and software of
massively parallel machines that we can afford to build the world's
biggest computer for reliability testing?
Sure, there are scale-up unknowns in any engineering system, and you
never really know until you build the actual system. But in real
life, not the fantasyland of the DoE, you master the technology before
you build full scale to find out how many loose wires you get in the
real thing.
Now, mind you, DoE thinks it has mastered the technology. After all,
more money has been poured into that black hole to wire up thousands
of anything you could dream of than the GDP's of most of the world's
countries. And look at the sublime mastery of MPP that has resulted!
Partitioning with problems of size x is somehow different than
partitioning with problems of size 100x? IF you really need to test
the interbox networking, cable up three boxes. I'm sure IBM has
already done that. Give a university three boxes to play with.
Look. These guys are itching to set something off to see if things
still work. Let them do it. Until they do, they'll bleed the US dry
of research money working out fantasy scenarios of completely unknown
correspondence to reality on computers they don't understand. I
wouldn't be surprised if they're really spending all this money to
build that case.
>There are probably a hell of a lot more that'll bite your
>ass that I haven't even dreamt of in my worst nightmares.
>I'd love to play with a machine like that to pick up some
>new nightmares though. :)
>
That's just the point. Neither you nor anyone else with the capacity
to form an independent opinion will be given that chance.
RM
I have no idea if such a thing is possible or feasable. Just wondering....
del cecchi
You know, we could paint that brush just about anywhere. Don't make it
anymore true.
> B******t. Is there something qualitatively different that happens
> when you put hundreds of thousands of processors together than when
> you put just thousands of processors together? If there is, please
> tell me what it is, and I'll have learned something about parallel
> computing that I can't otherwise imagine.
>
Yes! you run into more complex and difficult scaling issues that require
new ideas and work to get around. Solutions to scaling issues at a small
level tend to break down at a large level and solutions at a large level
tend to be very sub-optimal at a small level. This is pretty basic. Oh,
and all that thousands of FPUs in your streaming processor? same issue. We
just get to do the work of figuring out the solution with roughly 1/4 of the
cost to do it on new and unproven hardware.
> If you put all the money, all the resources, and all the
> decisionmaking in one place, you get an institutional steamroller and
> a dimwitted monoculture. Put a few more independent eyes on the
> problem and spread the decisionmaking and the thinking out and you'll
> get more results faster. Read that: you'll get results.
>
Hey, and all those people at the national labs with very different ideas of
what they need and want are a monoculture. Give me a break.
> >And I would agree. Having seen realistic code which is written in C and
> >Fortran, I don't give it much of a chance. If we can design a language
that
> >will work well on streaming architectures, it will work better than C and
> >Fortran on current MPP machines. When we get that code base, we WILL
design
> >and produce the hardware to exploit it.
> >
>
> I think Nick already adressed the circularity of your advice nicely.
>
There is nothing circular there. We already have hardware that is shipping
and available that will make significant use of whatever this great
programming break through is. When the programmers can actually harness it,
we'll come out with something newer and bigger and wait another 30 years for
the software to catch up.
I could produce a stream processor in 3-4 years.
It would ship.
A couple of people would buy them.
They would get crappy performance.
That would be the last stream processor you would ever see.
This has been repeated often enough that it should be obvious. Until there
are the software tools available that will make it possible to extract the
performance of some new fangled architecture, any production quality
hardware will flop and kill the branch. The tools can be developed without
the hardware. The hardware cannot function without the tools. There is no
circle. From algorithms and software, hardware is designed.
Funny: you ask to be educated about parallel computing, and then when
someone attempts to you slough them off.
Two orders of magnitude more frequent failures *qualitatively* changes the
approach you must take to tolerating them - at least for any given
computation granularity. Hell, people just went through a similar
discussion about the non-ECC Apple supercomputer in Virginia: its simple
size makes what might be tolerable in a smaller system something that must
be dealt with specially.
Similar issues occur in large storage systems, where you have to introduce
entire new intermediate levels of organization (involving limiting
distribution of replicas to local subsets of the overall array) to avoid
N**2 decreases in system MTTF (and you still have to deal with linear
decreases unless you also increase the level of replication - which is also
a step-function change involving qualitative trade-offs rather than a smooth
continuum).
...
> Partitioning with problems of size x is somehow different than
> partitioning with problems of size 100x?
Well, yuh. Once again, any given problem has one or more 'natural'
partitioning granularities: try to distribute it at two orders of magnitude
finer granularity than the smallest of these, and not only does efficiency
drop markedly but just writing the software is a lot harder.
In this case, you may be talking about embarrassingly parallel problems
which can easily be run on systems a couple of orders of magnitude larger
than today's, so for that *particular* class this objection may not apply.
OTOH, that particular class also often lends itself to cluster-oriented
handling, so in at least some cases there's no need for a new hardware
platform to accommodate it.
All your ranting and raving about this doesn't seem to have convinced any of
the intelligent and well-informed people you've been talking with here -
many of whom have advanced objections that you haven't even come close to
addressing. If that doesn't tell you something, I don't know what will.
- bill
[SNIP]
> What would you guys do if you did have access to a partial blue gene/l ?
> I'm just hallucinating here on a friday afternoon wondering what would
> happen if I could get hold of one of those washing machine things that was
> used for bring up and get somebody (like the guy it belongs to actually)
to
> make it accessable.
Hmmm... OK, if I wasn't required to actually do useful work
with it and just 'play' with it ? First off I'd bring up an
OCCAM compiler & loader for the hell of it. Second off I'd
keep my toes warm running an incredibly lightweight parallel
mandelbrot. Sure, it's trivial, but getting data in and out
of the box fast enough to keep it sweating can present a
challenge and it'll teach me a fair bit about the machine
(OK, it did the last time I tried this *many* moons ago).
I figure it might make a really handy little render-farm for
folks who don't want to install more air-con to upgrade their
render farm, and that would fulfill my "must render pretty
pictures" requirement. Plus it should keep my toes warm. I'm
obsessed with it at the moment, because my flat is bloody
freezing. Sorry. :/
For my own personal-semi-serious-playtime I'd like to try
out some OS ideas. IBM ships the box with a single-tasking
midget kernel on the nodes. One idea I'd really like to
play with on a box like that is the one explored in this
group a while back by Glew - transactional memory (probably
not the term he used, not one I'd use either :/)... A box
like that would provide plenty of rope to hang yourself
with, which I think is important for playtime ideas.
> I have no idea if such a thing is possible or feasable. Just
wondering....
As long as my toes are warm and RM is drooling with rage &
envy I'll be happy. I honestly think he'd spontaneously
combust if he burst in on my zooming around on a 3D fly-by
mandelbrot. If that didn't work I'd show him the OCCAM 2
source and I think he'd just die on the spot with rage at
the sight of that horrible neanderthal CSP code. :)
I'd invite O'Connor round too, so he could kick the tires
on it (well, OK, kick the shit out of it) so I could test
what happens when stuff goes wrong. If Dennis didn't feel
like doing that I suppose I could borrow a cat which is
*bound* to find the weaknesses in no time flat.
I'm sure there are plenty of people here who would use a
box like that far more constructively. I'm wondering if
the weather folks would get useful work out of such a
box. I hope Toone is reading.
Cheers,
Rupert
Bill, I've more or less shut my mouth when you've told me I'm off my
turf. You think I might possibly have something going for me here
aside from just having a bad hair day? You're right, I don't know a
damn thing about computers for banks and corporations. Thanks to you,
I've learned alot.
>Two orders of magnitude more frequent failures *qualitatively* changes the
>approach you must take to tolerating them - at least for any given
>computation granularity. Hell, people just went through a similar
>discussion about the non-ECC Apple supercomputer in Virginia: its simple
>size makes what might be tolerable in a smaller system something that must
>be dealt with specially.
>
That's right. You take the approaches to quality with a smaller scale
system that you would need to for a scalable system. I had this
argument in another forum with someone who is probably silently
following this exchange. You do that so that the small-scale system
is scalable. He didn't seem to think that was important. I did.
With the small-scale system, especially with many small scale systems
in many hands, you can can get a very good handle on failure rates,
especially if you recognize it as a task requiring attention.
>Similar issues occur in large storage systems, where you have to introduce
>entire new intermediate levels of organization (involving limiting
>distribution of replicas to local subsets of the overall array) to avoid
>N**2 decreases in system MTTF (and you still have to deal with linear
>decreases unless you also increase the level of replication - which is also
>a step-function change involving qualitative trade-offs rather than a smooth
>continuum).
>
In the circumstances you are talking about and which you understand in
detail, banks and corporations are not slapping together networks of
thousands of machines with a new architecture and using software
techniques that are poorly understood. In fact, you've put your
finger on a succinct way of saying it: no bank, no enterprise would go
about it this way. Only our national laboratories, with their unique
charter to warm the earth by burning money could get away with the
record of nearly reckless experimentation that has gone on with very
little payoff.
I've listened to you rant and rave about Intel and its misadvantures
with Itanium. By comparison with the results that have been delivered
by MPP in the hands of the bomb labs, Itanium has been a technological
miracle.
Problems arise for the telephone system that don't arise in Wichita,
Kansas, but I'll bet that, especially in the old days of AT&T, if
you'd gone from one place to the other, you'd have been able to map
the hardware easily from one place to another.
>...
>
>> Partitioning with problems of size x is somehow different than
>> partitioning with problems of size 100x?
>
>Well, yuh. Once again, any given problem has one or more 'natural'
>partitioning granularities: try to distribute it at two orders of magnitude
>finer granularity than the smallest of these, and not only does efficiency
>drop markedly but just writing the software is a lot harder.
>
>In this case, you may be talking about embarrassingly parallel problems
>which can easily be run on systems a couple of orders of magnitude larger
>than today's, so for that *particular* class this objection may not apply.
>OTOH, that particular class also often lends itself to cluster-oriented
>handling, so in at least some cases there's no need for a new hardware
>platform to accommodate it.
>
In getting it down to that one sentence, I thought, do I really need
to wrap this in protective language about the obvious? If you can
identify scaling issues having to do with partitioning, you need to
find a way to test them. If you can't figure out how to do that short
of building a machine with hundreds of thousands of processors, it is
either for want of imagination or because you just want to build a
machine with hundreds of thousands of processors.
>All your ranting and raving about this doesn't seem to have convinced any of
>the intelligent and well-informed people you've been talking with here -
>many of whom have advanced objections that you haven't even come close to
>addressing. If that doesn't tell you something, I don't know what will.
>
Nice of you to speak for the group. If I thought I was alone on this,
I'd shut up. There really aren't many people who know anything about
this who are also going to candid about the way the people who control
the money go about their business.
RM
>
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:0ei1tv8b5fl8uaohr...@4ax.com...
>> The mission of places like LLNL is to preserve its budget and its
>> institutional integrity (as in intactness, not honesty). If you don't
>> have the perspective to see that, it's because you're feeding at the
>> trough. LLNL is as non-proprietary and as competent as Microsoft and
>> produces the same quality of work with everything hidden behind a dark
>> curtain where only friendly eyes can peek.
>>
>The mission of every university and PhD is to preserve its budget and its
>institutional integrity ( as in intactness, not honesty). If you don't have
>the perspective to see that, it's because you're feeding at the trough or
>just plain ignorant. Universities and PhD's are as non-proprietary and as
>competent as Microsoft and produce the same quality of work with everything
>hidden behind a dark curtain where only friendly eyes can peek.
>
>You know, we could paint that brush just about anywhere. Don't make it
>anymore true.
>
You couldn't just paint that brush anywhere.
The difference, and it is a **big** one, is that their is no
uberreichsfuhrer of universities and of acadamic research. There is a
peer review system that is open. Universities and individual
researchers compete with each other openly and aggressively for funds
and attention. They can be just as blunt as I have been in
criticising each others' work, and it gets heard.
The peer review system of science can be almost as tyrannical as the
DoE, but at least it is open and can and is constantly subject to
critical review.
I brought up the issue of Microsoft, which houses Leslie Lamport and
Tony Hoare but which *still* manages to crank out buggy software by
the megaline. I'm hard over on open source for research: I want
thousands of eyes to see everything. The bomb lab system is as closed
as you can get.
>> B******t. Is there something qualitatively different that happens
>> when you put hundreds of thousands of processors together than when
>> you put just thousands of processors together? If there is, please
>> tell me what it is, and I'll have learned something about parallel
>> computing that I can't otherwise imagine.
>>
>Yes! you run into more complex and difficult scaling issues that require
>new ideas and work to get around. Solutions to scaling issues at a small
>level tend to break down at a large level and solutions at a large level
>tend to be very sub-optimal at a small level.
1. You put up with sub-optimality at small scale in anticipation of
moving to larger scale. Some people don't think that's a good idea.
I do, and I think that's pretty basic.
2. I will ultimately defer to you on this because
a. You're a real hardware guy. I'm not.
b. You're closer to the problem than I am.
>This is pretty basic. Oh,
>and all that thousands of FPUs in your streaming processor? same issue. We
>just get to do the work of figuring out the solution with roughly 1/4 of the
>cost to do it on new and unproven hardware.
>
Couple of points here.
1. I've listened to what people have had to say about streaming
processors. Many good points have been raised, and I'm not so (what's
the word?) callow (brief visit to the web to check the dictionary to
make sure I had the definition right and to the thesaurus to see if I
couldn't find a more precise word. I couldn't.) as to think or so mad
as to propose that we should divert even a significant fraction of the
nation's research money on computer architectures and/or
supercomputers into exploring them. I certainly don't think we should
run off and build a supercomputer out of them.
2. I do think that streaming architectures deserve more attention and
funding than they're getting.
3. My gut instinct tells me that we will go through a revolution just
about that dramatic before we run out of Moore's law room for
improvement and that we should be looking for that revolution
actively, rather than just hoping that it will happen.
>
>> If you put all the money, all the resources, and all the
>> decisionmaking in one place, you get an institutional steamroller and
>> a dimwitted monoculture. Put a few more independent eyes on the
>> problem and spread the decisionmaking and the thinking out and you'll
>> get more results faster. Read that: you'll get results.
>>
>
>Hey, and all those people at the national labs with very different ideas of
>what they need and want are a monoculture. Give me a break.
>
They all work for the same boss, and the boss decides where the big
bucks go. If you want to take this offline, I'll be glad to. I have
alot of energy for this obviously sore topic.
I have friends, enemies, former colleagues, former office mates
working in the national lab system. They have a variety of
personalities and capabilities. I don't think the same thing of all
of them and they don't all the same thing of me. Yup. There's alot
of diversity in the people, but that does not translate into anything
close to the peer-reviewed system of open scientific research.
People have taken the secrecy that was justified during World War II
and turned it to other purposes. Even during WWII and afteward, the
secrecy didn't work to protect secrets and it still doesn't. Stalin
got plans for the A-bomb, then for the H-bomb, and China has plans for
our latest warheads. You can bet somebody inside Al Qaeda does, too.
Hasn't worked at all for protecting national security secrets, but
it's golden for protecting careers, budgets, and cozy relations.
I could be *much* more blunt, but I don't want the FBI at my door.
>
>> >And I would agree. Having seen realistic code which is written in C and
>> >Fortran, I don't give it much of a chance. If we can design a language
>that
>> >will work well on streaming architectures, it will work better than C and
>> >Fortran on current MPP machines. When we get that code base, we WILL
>design
>> >and produce the hardware to exploit it.
>> >
>>
>> I think Nick already adressed the circularity of your advice nicely.
>>
>There is nothing circular there. We already have hardware that is shipping
>and available that will make significant use of whatever this great
>programming break through is. When the programmers can actually harness it,
>we'll come out with something newer and bigger and wait another 30 years for
>the software to catch up.
>
Um, in the nearly complete absence of money to do anything other than
to build huge supercomputers and to fund the national laboratories who
are aggrandizing themselves with them.
>I could produce a stream processor in 3-4 years.
>It would ship.
>A couple of people would buy them.
>They would get crappy performance.
>That would be the last stream processor you would ever see.
>
>This has been repeated often enough that it should be obvious. Until there
>are the software tools available that will make it possible to extract the
>performance of some new fangled architecture, any production quality
>hardware will flop and kill the branch. The tools can be developed without
>the hardware. The hardware cannot function without the tools. There is no
>circle. From algorithms and software, hardware is designed.
>
Historically, it just hasn't happened that way.
RM
<snip>
>What would you guys do if you did have access to a partial blue gene/l ?
>I'm just hallucinating here on a friday afternoon wondering what would
>happen if I could get hold of one of those washing machine things that was
>used for bring up and get somebody (like the guy it belongs to actually) to
>make it accessable.
>
>I have no idea if such a thing is possible or feasable. Just wondering....
>
Same thing I do now with my pile o' pc's. Load stuff up, compile it,
run it, profile it. Compare it on a cost and ease of use basis with
my pile o' pc's and other computers I have known.
Say even more nice things about IBM. ;-).
Codes appropriate for MPP? I got codes up the wazoo. I'd freak out
just thinking what I'd want to do with whatever limited access I got.
And you can bet that, even if the access was free, I'd do alot of
small scale studying so as not to waste whatever opportunity I got.
Got a simulator? I'll take that.
Don't want to trust me with a simulator? I'll take access to a
simulator.
RM
Iain> And the memory system is dominated by packaging.
Iain> [better packaging...]
Nick> No, that addresses only the bandwidth issue.
It addresses latency, too. Getting through northbridges is a
waste of time. The CPU has to directly signal the DRAMs to
get latencies in the <70ns range. Once that's done, the next
steps to improving latency are:
1 speculative execution on data return. You want to start
up the CPU core as soon as the data comes back. Graduate
only after ECC and ownership issues have been resolved.
It looks as if K8 falls down here, probably because the
legacy K7 core doesn't really do data speculation.
2 reduce the capacitance on the wires between CPU and DRAM.
Get rid of the CPU socket and motherboard traces and vias.
Get rid of the pitch-matching area required for large
wire density changes.
3 get a strobe signal per DRAM to the CPU. This permits
improvement in the signalling between the two, allowing
lower latency physical-level protocols.
4 higher speed, lower density DRAM. Once processors
connect directly to DRAM and the DRAM core speed actually
matters, you'll see DRAM manufacturers start to tweak this
tradeoff towards better latency. It's already happening
in the graphics and network processor world, but mostly to
improve bandwidth.
2 and 3 are directly about packaging... when the CPU vendor has
tighter control over the memory interface, they can cut down
the margins required to get the thing working regardless of
who supplies the components. Then you get smaller, faster,
denser chip I/O.
Then there's the stuff that doesn't have to do with packaging,
but just good implementation: How long does it really have to
take to get through the TLB, fetch the L2 tags, find the miss,
and get to the pins? 5ns? Right now I bet it takes a lot
longer than that.
I'm curious, what kind of simulation would you be looking
at here ? Simulating MPPs always struck me as a bit of a
mission impossible... Simulating the comms networks, useful,
but probably hideously difficult to get right... Simulating
the build/execution environment - certainly doable, but
would it be *really* useful beyond porting your code ?
When folks say Simulator in this NG I tend to think of
people counting CPU cycles. That doesn't sound like what I
*think* you would be asking for here.
Cheers,
Rupert
<snip>
>
>I'm curious, what kind of simulation would you be looking
>at here ? Simulating MPPs always struck me as a bit of a
>mission impossible... Simulating the comms networks, useful,
>but probably hideously difficult to get right... Simulating
>the build/execution environment - certainly doable, but
>would it be *really* useful beyond porting your code ?
>
>When folks say Simulator in this NG I tend to think of
>people counting CPU cycles. That doesn't sound like what I
>*think* you would be asking for here.
>
I'm doing work on Itanium. I don't own an Itanium processor. I could
purchase or probably get access to one, but there is *so* much that I
am doing that you can do with a real simulator that you can't do with
a real processor, and I anticipate such drastic changes in the Itanium
architecture that development work around an actual, physical Itanium
seems inappropriate for someone with my long-range goals and limited
resources.
The simulators that are available to me aren't cycle-accurate. I wish
they were, but even that's not critical. If it were, I'd do whatever
cozying-up was necessary to get access to one that was.
The Blue Gene processor has a unique architecture (dual processor, one
serving as a network processor, four wide DP vector unit you can use
only when the network processor isn't busy. Weird stuff.). I have no
idea what could be done with the architecture, but with any kind of
simulator in my hands I might begin to get an idea.
The world beyond the single CPU? Somebody's got simulators for those,
too. I wouldn't be too ambitious in hoping that IBM would lift its
corporate skirts, but maybe the honchos at LLNL can make available
whatever they've been spending my tax dollars on.
Absent any help from anybody, give me a simulator that excecutes the
ISA in anywhere near the right order and I'll make something out of
it.
RM
Briefly: I felt (and still feel) that if you are building a 1024-CPU
MPP then you should build a 1024-CPU MPP. Which means that
scalability is irrelevant.
Robert convinced me that he believed that it is not permitted to
design a 1024-CPU MPP. Instead, one _must_ repeat _must_ design the
system so that it could be efficiently expanded to 10K or 100K CPUs.
Which means that scalability is essential.
I dropped the thread ("Scalability is a Boojum") once I realized that
both viewpoints were valid, depending on how you were squinting and
holding your tongue. ;-)
It also depends on what you're building the computer for.
If you're a corporation doing financial analysis or computation to
support inhouse research or to fulfill a contractural requirement,
then scalability is, indeed, someone else's problem.
If you're a public or nonprofit institution relying on taxpayer
dollars or taxpayer-subsidized donated dollars building a computer for
research on computation, then I will call upon the gods of computation
to make your life miserable if you build a research computer without
giving a thought to scalability.
If you're a computer manufacturer supplying either of the above
classes of clients, then whether to build in scalability and what
price to pay for it is a business decision. I've noticed that most
computer manufacturers are keen on selling their hardware as scalable.
RM
Yes, but it is a much lesser point than for bandwidth. Only half
of those points are packaging (at least directly), and you are
talking about a factor of two improvement at most, more probably
only 20%.
>Then there's the stuff that doesn't have to do with packaging,
>but just good implementation: How long does it really have to
>take to get through the TLB, fetch the L2 tags, find the miss,
>and get to the pins? 5ns? Right now I bet it takes a lot
>longer than that.
Yes. My belief is that you could do a lot better by rethinking
this area. With a fairly modest redesign, I believe that you could
effectively eliminate all TLB delays and make some pretty impressive
improvement on some other aspects. This wouldn't make much difference
to most programs, but would essentially solve the problem of ones
that crawl to a halt (or even die) because of TLB mishandling.
But, for REAL improvements, you need to think a lot more radically.
I can't think of any way of reducing memory latency by more than a
small factor, so the only approach is to make a design where it is
less of a bottleneck. And that is back to programming paradigms.
Regards,
Nick Maclaren.
That just doesn't ring true. As you scale seemingly
small problems or issues can make their presence felt
in an exponentially bad kind of way. As we've both
commented the *likelihood* of a failure increases as
you scale... The consequences of failure can be far
harder to predict and contain too. MPP ain't trivial
unfortunately.
> argument in another forum with someone who is probably silently
> following this exchange. You do that so that the small-scale system
> is scalable. He didn't seem to think that was important. I did.
> With the small-scale system, especially with many small scale systems
> in many hands, you can can get a very good handle on failure rates,
> especially if you recognize it as a task requiring attention.
You can get an *inkling* but it's just not the same
as actually living with a large system. The cunning
folks at SGI who developed the Origin line must have
a huge number of stories to tell about this kind of
thing. Personally I don't think for one minute that
from day 1 they had a 4 -> 256CPU box that scaled
linearly and maintained the same MTTF. ;)
Cheers,
Rupert
OK ... Starting point might be a PPC 440 Simulator, there are
a few around. You can probably google, deduce and ask about the
rest of the ASIC's bits and pieces and probably tack them onto
the side of the simulator (remember, this is an *embedded core*
so the tools are probably designed to cover some kind of
simulation of the stuff bolted onto the side of the core).
That leaves the network itself, which I think is where the real
problems will start... :)
> The world beyond the single CPU? Somebody's got simulators for those,
> too. I wouldn't be too ambitious in hoping that IBM would lift its
> corporate skirts, but maybe the honchos at LLNL can make available
> whatever they've been spending my tax dollars on.
I must admit I'm curious as to what those folks have done in
that area.
> Absent any help from anybody, give me a simulator that excecutes the
> ISA in anywhere near the right order and I'll make something out of
> it.
Try www.ibm.com. Look for PowerPC 440, you'll find lots of
hits.
This hit grabbed my attention :
http://www-306.ibm.com/chips/products/powerpc/newsletter/aug2001/new-prod3.html
Looks like it might give you some hints as to what they are
doing with the FP side of things. Best of luck with your
simulator... :)
Cheers,
Rupert
That's a good link. Who knows how far I might be able to get with the
right prodding. ;-).
Given the right tools, I might learn some humility. With a toy to
play with (and barring help from anybody else, I can build a network
with toys), maybe The Force can be with me, too, and what is so
obvious to everybody else will become obvious to me.
At least I'll be too busy trying to come up with something to say
publicly to do much ranting and raving. If that isn't an incenitve
for somebody, I don't know what is.
>
>Looks like it might give you some hints as to what they are
>doing with the FP side of things. Best of luck with your
>simulator... :)
>
I've gotten whacked pretty hard here. That's all right. After all, I
started it.
I don't think anybody gets the main point. What's happening on
*today's* architectures isn't all that important. Where we're headed
is.
Whether the Cray-I (vector or scalar, COMA) architecture was the best
architecture for many probems or not didn't matter. If you were going
to be doing certain kinds of problems, you were going to wind up on a
Cray architecture, so it behooved you to be thinking about it when you
were writing code.
What architecture to think about now? Whatever Aaron Spink nodded yes
to in a secret meeting at LLNL?
Beowulf clusters are nice, but you're never going to do the kinds of
problems I've been talking about on a beowulf cluster. I think Va
Tech has other ideas, but I've already done my ranting and raving
about that.
So, when I get down to writing code in whatever language I choose to
write in, what language do I use?
If a real machine existed that was friendly to Occam, maybe I'd be
willing to invest my time in it. If you'd had a chance to play with
one of these boxes and told me that Occam was a good thing to try,
that would cut alot of ice. Just knowing what your experiences were
would be useful, whether I wound up using Occam or not.
As it is, we're going to learn what's on the DoE roadmap and whatever
they want us to know and no more. And this bizarre enterprise is
being done with taxpayer dollars (and, I might add, a significant IR&D
contribution from IBM. IBM wants its IR&D work classified and
massaged by the DoE? They must be desperate).
What architecture do I envision so I don't have to rewrite the code a
thousand times?
The hell with it. Just do what the national labs are doing: write C
or Fortan using some more or less accepted methodology and hope that
when the time comes it can be jiggered into whatever MPP architecture
is available at the time.
That's a lousy approach. Anybody trying to do something really
serious knows better.
One final cynical remark: it's not a lousy approach for the people at
the bomb labs who are going to hang on until retirement moving the
software funiture from one building to another while the rest of the
world has to scramble to do real innovation just to eat.
RM
RM
Indeed, some of which still can't be told for NDA reasons. ;-}
+---------------
| Personally I don't think for one minute that from day 1 they had a
| 4 -> 256CPU box that scaled linearly and maintained the same MTTF. ;)
+---------------
Well, I think it's probably safe to say that quite a number of unexpected
scaling effects did show up, some of them quite significant, and that most
of the weird ones were at the higher end of the scale (with more than, say,
32P). Given the published architecture [see Mashey's classic news posting
on NUMAlink], the magic steps tended to be at >2, >8, >64, and >256 [yes,
a few 512P O2000's were built, IIRC]. For the Origin 3000 [which had more
CPUs/node & more ports per router chip], those steps were at 4, 16, 256,
and ~512 or so. [Not sure about Altix...]
Note that scaling effects also tend to be *extremely* application-dependent.
You can be running along on a given configuration with app #1 for months or
years on end, happy as a clam, having no problems at all, and then one day
the user decide to run app #2 instead, which makes a different pattern of
memory references: *Whammo!!*
-Rob
-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607
I am not constrained by NDA in this respect, and can confirm your
statement. I can also say that such problems are not unique to SGI,
and have hit every vendor of close-coupled systems that I know of,
over 35 years, at virtually every increase in size.
Rupert Pigott's doubts are fully justified :-)
>Note that scaling effects also tend to be *extremely* application-dependent.
>You can be running along on a given configuration with app #1 for months or
>years on end, happy as a clam, having no problems at all, and then one day
>the user decide to run app #2 instead, which makes a different pattern of
>memory references: *Whammo!!*
I can confirm that one, too, and not just on SGIs.
Regards,
Nick Maclaren.
So, let me get this straight. I'm new to this field, so I'm not
completely familiar with the, um, engineering practices. Until this
thread, I thought the kind of lunacy I'm following was confined to the
software side of the house.
We have no theory of scaling. In fact, we have no theory. Not for
hardware, not for software. The only way to find out if a design is
right is to build it, full scale, and run the actual software you want
to run on it, full up. I guess software and hardware of all sizes
must come with same disclaimer about merchantability and fitness for
any purpose whatsoever.
Have I got this straight, or was my mind wandering off to wondering if
I had stumbled into a high school shop class when somebody said
something that made sense?
RM
>So, let me get this straight. I'm new to this field, so I'm not
>completely familiar with the, um, engineering practices. Until this
[...]
>We have no theory of scaling. In fact, we have no theory. Not for
>hardware, not for software. The only way to find out if a design is
>right is to build it, full scale, and run the actual software you want
>to run on it, full up. I guess software and hardware of all sizes
When you say "In fact, we have no theory", that's not a
bad starting point. It helps to explain why there's the
million dollar question: does P equal NP?
Though I think that question can be somewhat side-stepped,
it's a reminder that we're still lacking some insights...
d
Well, yes, and I wouldn't be so nervy except that I have pretty good
idea of the classes of problems the DoE would like to attack. Some of
the problems they'd like to attack are theoretically unsolvable, and
in those cases they've already chosen or are merely tweaking and
testing heuristic quasi-solutions they've already pretty much settled
on.
There are no perfect theories, but there are much better theories than
anecdotal evidence presented in comp.arch. One of my big beefs here
is that I would bet my last nickel that the DoE hasn't budgeted a
nickel to test and expand what theoretical knowledge of parallel
systems does exist. Because not everyone is bought into the corporate
mentality of the money-burning factory at Livermore, some of that work
might get done, anyway, but it will be done on the sly.
As to the scalability of hardware and failure rates and making
allowances for the unexpected, I would probably rate IBM as the
world's largest repository of knowledge both theoretical and practical
as it applies to computers, and in that sense, they are the ideal
incumbent contractor. There *might* be some money in the budget for
it, but I doubt if the DoE is funding any of it or assigning anyone to
get smart about it.
RM
After all there is a lot, no, make that A Lot Of Money floating around
out there in bio and genomics land. And some of that money could flow
to purveyors of tools to the trade. And who might that be? Satan...
no wait, just flashing back to a Church Lady episode. It could be
someone with a million processor petaflop machine eh?
Perhaps there is an architectural silver bullet out there. Be happy to
build it. Just break it down into chunks of less than 50 or 100 million
gates and 1000 signals or so. Bring money. In the meantime folks will
be using the tools they have to attack the problem.
del cecchi
I can't speak to the current round, but from a five year old
perspective, this is on target. I asked Gil Weigand once, when he was
heading the ASCI program, whether he would consider spending 1% of the
program budget on computer architecture and software research, and was
told that it was impossible, that every penny was desparately required
to solve critical immediate problems. I see no sign that this
attitude has changed. As long as this is true, we will see no
significant innovation. David Tennenhouse periodically gives a great
speech about how essential it is for the universities to perform
ground breaking research in computer architecure, telling us how Intel
is building on the past university research portfolio. Then you try
to convince him that Intel might support some of that research, and
nothing happens. The innovative computer architecture "community" has
been intentionally or unintentionally completely trashed over the past
decade.
>I can't speak to the current round, but from a five year old
>perspective, this is on target. I asked Gil Weigand once, when he was
>heading the ASCI program, whether he would consider spending 1% of the
>program budget on computer architecture and software research, and was
>told that it was impossible, that every penny was desparately required
>to solve critical immediate problems.
ASCI Path Forward spends money on both hardware and software. It's
probably not what you would call "computer architecture and software
research", but it's not nothing.
As an example of what they're interested in, see:
http://www.llnl.gov/asci/pathforward_trilab/OSSODA_PF_RFI_V9e.pdf
-- greg
>Robert Myers <rmy...@rustuck.com> writes:
>> There are no perfect theories, but there are much better theories than
>> anecdotal evidence presented in comp.arch. One of my big beefs here
>> is that I would bet my last nickel that the DoE hasn't budgeted a
>> nickel to test and expand what theoretical knowledge of parallel
>> systems does exist. Because not everyone is bought into the corporate
>> mentality of the money-burning factory at Livermore, some of that work
>> might get done, anyway, but it will be done on the sly.
>
>I can't speak to the current round, but from a five year old
>perspective, this is on target. I asked Gil Weigand once, when he was
>heading the ASCI program, whether he would consider spending 1% of the
>program budget on computer architecture and software research, and was
>told that it was impossible, that every penny was desparately required
>to solve critical immediate problems.
At the risk of igniting discussion about side issues, the answer that
was given to me as a part of a group when it was brought up that we
really needed a more solid theoretical basis for what we were doing (I
might even have been seated in a national laboratory at the time. In
fact, I think I was): "You're asking to spend money on research when
the military is scrimping on buying ammunition." I wouldn't be
surprised if someone who needs to deliver the line often doesn't
deliver it in front of a mirror to make sure it sounds like the "There
will be no more discussion" line it is intended to be.
>I see no sign that this
>attitude has changed. As long as this is true, we will see no
>significant innovation. David Tennenhouse periodically gives a great
>speech about how essential it is for the universities to perform
>ground breaking research in computer architecure, telling us how Intel
>is building on the past university research portfolio. Then you try
>to convince him that Intel might support some of that research, and
>nothing happens. The innovative computer architecture "community" has
>been intentionally or unintentionally completely trashed over the past
>decade.
>
Unintentionally, I really do think. I've been around for a while,
and, from my perspective, the last decade was a helluva ride. It's
just that all the money for computer architecture came from PC and
small server applications, and that's where, for the most part, all
the money has been spent.
I don't get points here for it, but that's why I'm enthusiastic about
Itanium. I really don't care what Intel's motives are, the
architecture raises really good and very basic issues and Intel has
funded some good research that otherwise wouldn't have been done.
And, for that matter, why should the government put money into
computer architecture when kids playing computer games are funding
really smart people to come up with really amazing stuff? Cannot help
mentioning that I believe the current highest single processor Spec
CPU2000 score for a single processor is held no longer by Itanium, but
by the P4EE, a special edition built to keep PC gamers from defecting
to Opteron.
The answer, of course, is that we've all known for a very long time
that we'd have to figure out how to get lots of processors to work
together, and nobody that matters understands that we really don't
know how to do that correctly.
Is the CEO of IBM or Intel going to sit down with the Secretary of
Energy and patiently explain to him that, when you get right down to
it, it's a matter of luck that computers with more than one processor
don't fail at critical moments, because the little bit of theory we do
have to address the issue is rarely used?
Somebody, somehow, needs to convince somebody that matters that the
COTS (Commercial-off-the-shelf) strategy and asking industry to pick
up the tab for basic research is a recipe for national technological
suicide. I found the documentation I discovered around the DoE's
reaction to the Earth Simulator--and the fact that they didn't even
realize how embarrassing it was so as to hide it--just incredibly
discouraging. It served the needs of the point I was trying to make
here well, but g*d help us!
I'll get down off the soapbox now. I'm not a very devious person. If
I had an answer, I'd have proposed it straightaway. I've done the
only thing I know how to do, which is to address the audience I have
access to with what I believe are legitimate concerns in the hope that
a way can be found to address them.
RM
>
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:cdh7tvolnis2k14es...@4ax.com...
>>
>> As to the scalability of hardware and failure rates and making
>> allowances for the unexpected, I would probably rate IBM as the
>> world's largest repository of knowledge both theoretical and practical
>> as it applies to computers, and in that sense, they are the ideal
>> incumbent contractor. There *might* be some money in the budget for
>> it, but I doubt if the DoE is funding any of it or assigning anyone to
>> get smart about it.
>>
>> RM
>>
>Much as you might not believe it, there is in fact a possibility that
>the PetaFlop version of Blue Gene was in fact conceived by the folks in
>IBM Research to solve the problem that they say it was conceived for,
>i.e. protein folding. And there is a possibility that this is not
>merely a cover story to hide a conspiracy of Neanderthal programmers and
>physicists in California hell bent on maintaining an empire of
>inappropriate computer architectures.
>
>After all there is a lot, no, make that A Lot Of Money floating around
>out there in bio and genomics land. And some of that money could flow
>to purveyors of tools to the trade. And who might that be? Satan...
>no wait, just flashing back to a Church Lady episode. It could be
>someone with a million processor petaflop machine eh?
>
You bet it could be, and I hope you win. How many times do I have to
say it? I have no beef with IBM.
If IBM really wants to lance the brass ring, there are some steps IBM
could take to smooth things over, but whether IBM takes those steps or
not is a business decision because that's what IBM is, a business.
The paragraph you were kind enough to quote expresses my true opinion:
IBM will pay attention to those kinds of details whether the DoE has
the wits to or not, and, as you can imagine, I'm skeptical how much
the DoE would do if left to its own devices. Gotta get those color
plots out. The Director is briefing a subcommittee tomorrow...
>Perhaps there is an architectural silver bullet out there. Be happy to
>build it. Just break it down into chunks of less than 50 or 100 million
>gates and 1000 signals or so. Bring money. In the meantime folks will
>be using the tools they have to attack the problem.
>
Ayup. When I left the place with the big limestone dome and had no
idea what I wanted to do with myself, I did what all young physicists
who have lost their way do: I took a job carpentering.
Norm Abrams obviously never met the people I worked with. There was a
hand-held circle saw and I had my own drill, but other than the really
big stuff (jack hammers, etc.), those were the only power tools on the
site, and it was not a small job. Five stories of nineteenth century
brick, moldings everywhere, sagging floors, everything someone who
wanted to learn the trade could possibly ask for.
I can still cut and fit a miter by hand. Not even one of those
cheesey little wooden miter boxes you see at Home Depot. Just lay it
out and cut it--carefully.
RM
How much has been spent, say, on the ASCI program these past years -
I'd guess less than 1 part in 10.000 of the annual DoD budget. You don't
get many GDPs at the few-hundred-million-USD level, although I'm sure
there are some.
> And look at the sublime mastery of MPP that has resulted!
That, of course, is another matter...
I now understand to complain that "black" programs have, in the past,
often shown themselves to not be cost-effective (from what little _is_
known about them), and that secrecy in general has been used to hide
mediocrity or worse - and I would clearly support that view. Of course,
"intelligence" (in the CIA/NSA sense of the word), weapons programs
and computing are particularly susceptible because very often, there
is no way to measure or objectify success.
Whether the current crop of DoE-/DoD-funded supercomputing falls in
this category, I just don't know (insufficient data).
Jan
It is MUCH, MUCH worse on that side, true, but is not restricted to
it. It even affects mechanical engineering, incidentally.
>We have no theory of scaling. In fact, we have no theory. Not for
>hardware, not for software. The only way to find out if a design is
>right is to build it, full scale, and run the actual software you want
>to run on it, full up. I guess software and hardware of all sizes
>must come with same disclaimer about merchantability and fitness for
>any purpose whatsoever.
NO, that is NOT what I am saying!
The existence of a theory does not mean that you can answer all useful
questions (P = NP is a poor example, but will do). Modern hardware
is being pushed to the limit on performance, and that means that some
questions of consistency are unanswerable. The designers attempt to
make things consistent (with a certain level of confidence), but don't
always spot every issue. Hence the problem.
There would be no difficulty in designing a scalable architecture, at
the cost of significantly worse (3x? 5x? log(N)x? Nx?) performance.
Regards,
Nick Maclaren.
A similar fate befell the transputer, although the detailed reasons for its
failure to succeed commercially are quite comlex and time-depedant.
> >This has been repeated often enough that it should be obvious. Until there
> >are the software tools available that will make it possible to extract the
> >performance of some new fangled architecture, any production quality
> >hardware will flop and kill the branch. The tools can be developed without
> >the hardware. The hardware cannot function without the tools. There is no
> >circle. From algorithms and software, hardware is designed.
> >
> Historically, it just hasn't happened that way.
Historically, I have seen numerous projects that invested a lot - possibly
even enough - on the hardware, and quite definitely not enough on the soft-
ware. Historically, a lot of such projects have been funded as a hardware
project (with a little money for software), and some as a software project
(with a little money for hardware) - but I can remember only one that
came even somewhat close to be funded as a _systems_ project: that was
Philips TriMedia (or whatever - there is an embedded family from Siemens
that sounds similar and which I always confuse with the Philips stuff) -
and look where that thing went.
Jan
Repository? Well, maybe, but think early retirement. For example,
are there any of the Santa Teresa people who understood language
run-time systems issues left?
Regards,
Nick Maclaren.
[SNIP]
> >Note that scaling effects also tend to be *extremely*
application-dependent.
> >You can be running along on a given configuration with app #1 for months
or
> >years on end, happy as a clam, having no problems at all, and then one
day
> >the user decide to run app #2 instead, which makes a different pattern of
> >memory references: *Whammo!!*
>
> I can confirm that one, too, and not just on SGIs.
Hmm, I'm curious about how one manages such a large
shared-memory box... My experience of lots of processors
is strictly limited to Transputers which had relatively
well defined behaviour, and of course that was entirely
'single-user' with no OS... An OOO CPU taking data from
memory across the other side of the box occasionally
must be difficult to keep a lid on.
Anyone able to contribute some pearls on running such
a box ? I've seen Nick's frequent pleas for better/rants
about error handling and debugging tools... Email is
fine, no hyphens should remain it my email address if
you've decoded it right, I should warn you that I have
a big yap, so don't share anything sensitive. :)
Taking a wild stab in the dark I *guess* that partitioning
must figure quite highly in the tools of the trade. In
fact I believe that something like that must be because
you've got to contain failure somehow... Hot-plug every-
thing *seems* like a good idea too, as you probably don't
want to spend a few hours rebooting those big systems
every five minutes.
Cheers,
Rupert
[SNIP]
> I don't get points here for it, but that's why I'm enthusiastic about
> Itanium. I really don't care what Intel's motives are, the
> architecture raises really good and very basic issues and Intel has
> funded some good research that otherwise wouldn't have been done.
Erm, but IA-64's roadmap points towards it becoming yet
another OOO multi-core beastie. Not only that but I got
the impression it was built largely on research done by
HP folks.
[SNIP]
> The answer, of course, is that we've all known for a very long time
> that we'd have to figure out how to get lots of processors to work
> together, and nobody that matters understands that we really don't
> know how to do that correctly.
Plenty of folks have some great ideas out there, it's
a question of reading & listening. There is also a
fair amount of work that has happened *outside* of the
USA that is worthy of study (as your example of the
Earth Simulator shows).
> Is the CEO of IBM or Intel going to sit down with the Secretary of
> Energy and patiently explain to him that, when you get right down to
> it, it's a matter of luck that computers with more than one processor
> don't fail at critical moments, because the little bit of theory we do
> have to address the issue is rarely used?
There is theory though if you go out and look for it,
plenty of it in fact. I recall wading through piles
of CSP papers when I was trying to deepen by knowledge
about how to apply OCCAM effectively. Some of those
papers helped me develop models of networks and
applications. I'd use those models to identify weak
points and knotty problems in a design. They were
largely informal, but they were very throughly based
on my limited grasp of theory.
CSP is a relatively small player in that space too,
I'm sure there are lots of other papers out there for
the more popular styles of MPP.
Cheers,
Rupert
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:jso7tvo6olbsssech...@4ax.com...
>
>[SNIP]
>
>> I don't get points here for it, but that's why I'm enthusiastic about
>> Itanium. I really don't care what Intel's motives are, the
>> architecture raises really good and very basic issues and Intel has
>> funded some good research that otherwise wouldn't have been done.
>
>Erm, but IA-64's roadmap points towards it becoming yet
>another OOO multi-core beastie.
Jan Vorbrüggen elsewhere commented that hardware and software were
usually developed independently of one another. If Itanium has done
nothing else, it has made the compiler part of Intel's processor
design. That strategy has already paid off where it counts the most:
with P4/Xeon performance.
Itanium will not wind up looking like just another CISC/RISC
processor, no matter what happens with OOO.
>Not only that but I got
>the impression it was built largely on research done by
>HP folks.
>
Dunno about the responsibility for the original design (and nobody
seems eager to take responsibility for that at this moment) but the
names on the interesting papers that are coming out are from Intel.
>[SNIP]
>
>> The answer, of course, is that we've all known for a very long time
>> that we'd have to figure out how to get lots of processors to work
>> together, and nobody that matters understands that we really don't
>> know how to do that correctly.
>
>Plenty of folks have some great ideas out there, it's
>a question of reading & listening. There is also a
>fair amount of work that has happened *outside* of the
>USA that is worthy of study (as your example of the
>Earth Simulator shows).
>
I was wondering when someone was going to take exception to the fact
that I was taking a fairly parochial (US) point of view. The
consequences of the US losing is position of world leadership in
technology would be far-reaching, and I don't think that Europe in
particular would be happy with some of them. It will happen sooner or
later, but I don't think the world, never mind the US, is quite ready
for such a development.
<snip>
>
>There is theory though if you go out and look for it,
>plenty of it in fact. I recall wading through piles
>of CSP papers when I was trying to deepen by knowledge
>about how to apply OCCAM effectively. Some of those
>papers helped me develop models of networks and
>applications. I'd use those models to identify weak
>points and knotty problems in a design. They were
>largely informal, but they were very throughly based
>on my limited grasp of theory.
>
>CSP is a relatively small player in that space too,
>I'm sure there are lots of other papers out there for
>the more popular styles of MPP.
>
Right. We need to get this theory out into the field. People write
papers and go to conferences, and the people out in the field just go
on writing in C and Fortran, oblivious to what's available. It's
irresponsible for the people with the purse-strings not to put at
least some money into trying to wrestle all this stuff into shape so
that it can be and actually is used.
RM
The SGI boxes are cache-coherent (ccNUMA), with the user seeing
the same "sequential consistency" model as classic SMP, so that
to a first approximation the user programmer deals with the system
the way you would with any SMP multiprocessor.
It's the 2nd- and 3rd-order effects that get you: access patterns
that stress the hardware, the operating system's locks, the I/O, etc.
So when scaling a problem to more & more problems, you also have to
be on the lookout for the *next* problem that's going to bite you once
you've fixed up your current scalability issue.
+---------------
| An OOO CPU taking data from memory across the other side of
| the box occasionally must be difficult to keep a lid on.
+---------------
Not sure exactly what you're suggesting by "keep a lid on",
since a ccNUMA system is SMP-like cache-coherent, but yes.
Speculative loads can cause excessive memory network traffic
(cache-coherency protocol traffic, basically) if data isn't
placed "close" to the CPUs that are most-frequently accessing
them. This can cause both non-fatal but unexpected performance
bottlenecks, as well as perhaps stress the hardware (the coherency
system) past some as-yet-untested boundary and cause crashes.
+---------------
| Taking a wild stab in the dark I *guess* that partitioning
| must figure quite highly in the tools of the trade. In
| fact I believe that something like that must be because
| you've got to contain failure somehow...
+---------------
Unfortunately, in the case of ccNUMA, you *can't* simply re-partition
on failure -- at least, not without forcing a reboot in the process --
since the cache-coherency is distributed throughout the system. That is,
the failing CPU module you're trying to partition out of the system may
very well be holding in a "write exclusive" state cache lines the rest
of the system needs to continue functioning. (Oops!)
+---------------
| Hot-plug everything *seems* like a good idea too...
+---------------
Yup. But hot-UNplug (and thus, replacement after failure) only works
if the device is alive enough to successfuly quiesce it -- and with
devices based on commodity ICs, there are, sadly, many *MANY* device
failure modes that leave one no choice but to reboot.
+---------------
| ...as you probably don't want to spend a few hours rebooting
| those big systems every five minutes.
+---------------
Except that with a (semi-)log-structured filesystem such as XFS
[or others with the same crash/reboot properties], and with an O/S
[such as Irix] that, after some small initial single-CPU checking,
boots itself in parallel, it only *takes* about five minutes or so
to reboot the system.
[SNIP]
> +---------------
> | An OOO CPU taking data from memory across the other side of
> | the box occasionally must be difficult to keep a lid on.
> +---------------
>
> Not sure exactly what you're suggesting by "keep a lid on",
I was thinking that it's a very complex problem that demands
attention to detail. The potential failure modes and contention
issues strike me as being a big challenge to predict, diagnose
and resolve. One of the things I liked about CSP was that a
tiny little brain like mine could - at some level - understand
the likely performance hits & failure modes (eg: livelock/dead
locks). Doing that kind of thing for current SMP boxes gives me
the willies. :)
[SNIP]
> +---------------
> | Taking a wild stab in the dark I *guess* that partitioning
> | must figure quite highly in the tools of the trade. In
> | fact I believe that something like that must be because
> | you've got to contain failure somehow...
> +---------------
>
> Unfortunately, in the case of ccNUMA, you *can't* simply re-partition
> on failure -- at least, not without forcing a reboot in the process --
:(
I guess what I was imagining was some way of slicing the machine
up so it isn't one big cache-coherent box, but rather a box with
several cache-chorency domains... If that makes sense... Hence
if a fault occurrs in a domain you only have to reboot that one
domain, not the whole shebang... If you have that facility I
guess a "nice to have" would be adding & removing CPUs from CC
domains.
> since the cache-coherency is distributed throughout the system. That is,
> the failing CPU module you're trying to partition out of the system may
> very well be holding in a "write exclusive" state cache lines the rest
> of the system needs to continue functioning. (Oops!)
Yipe, but not entirely unexpected ! :)
> +---------------
> | Hot-plug everything *seems* like a good idea too...
> +---------------
>
> Yup. But hot-UNplug (and thus, replacement after failure) only works
> if the device is alive enough to successfuly quiesce it -- and with
> devices based on commodity ICs, there are, sadly, many *MANY* device
> failure modes that leave one no choice but to reboot.
I see this on desktops... It's distressing to say the least. :)
> +---------------
> | ...as you probably don't want to spend a few hours rebooting
> | those big systems every five minutes.
> +---------------
>
> Except that with a (semi-)log-structured filesystem such as XFS
> [or others with the same crash/reboot properties], and with an O/S
> [such as Irix] that, after some small initial single-CPU checking,
> boots itself in parallel, it only *takes* about five minutes or so
> to reboot the system.
Yeah, I figured that would be the case. Is that O(1) or O(n) ?
Thanks for your feedback. :)
Cheers,
Rupert
The word "only" suggests you think 5 minutes is good! It's this
kind of thinking that makes people accept horrible boot times. I
keep hearing "we must get/keep boot times to single digit numbers",
and I agree -- but I see seconds where most people see minutes.
You got one thing right, though: the importance of a safe file system.
But even there I don't understand why few systems take advantage of
this to avoid shutdown completely (except for courtesy reminders to
users in case of a planned shutdown of course, to give them a chance
to commit what they are prepared to commit at that point).
I find it positively perverse that boot times have been increasing
as machines have gotten faster...
Michel.
P.S. I think the trend has finally been reversed. For one thing,
hardware improvements have been such that even the grossest
software bloat has been unable to keep up with it.
Oh, yeah? You underestimate the ingenuity of the software
community.
Regards,
Nick Maclaren.
I think it would be a great boot time for a 512 node machine
with a few TB of spinning dust. Whether it's *fast enough*
that is another question... :)
Cheers,
Rupert
or the non-ingenuity ... it seems as if the view is that you don't
have to worry about performance issues because the machines are so
fast ... so they default to non-linear (much greater than linear)
solutions. then you just have to come up with explanations ... like it
is really good that it takes so long ... because it is doing so much
for you.
--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
Internet trivia 20th anv http://www.garlic.com/~lynn/rfcietff.htm
Definitely.
One example:
Some people touts faster startup times as good reason for switching to
WinXP from Win2K.
It turns out that XP does a little bit more than 2K, probably to keep
those boot times significant, one example is that it tries to check the
certificate of any AD server it is logging into.
Sounds reasonable, right?
So, what happens when you add Symantec's PC Firewall to the mixture:
The FW sw notes that you're trying to use a certificate, and decides
that it would be a good idea to make sure that said certificate hasn't
been revoked.
Still reasonable, right?
So, how do you check for revocation? By contacting Verisign's
Certificate Revocation List server (crl.verisign.com)!
Which port should you use to do this? Let's use port 80, since everyone
uses that one already, it is probably open!
Except that in any corporation large enough to care about these things,
web access is normally funneled through a proxy server, and any attempts
to bypass it is simply dropped.
Networking people can probably smell where this is going to end:
The CRL check tries and retries to setup a direct connection, finally
timing out for good after 1,5 to 2 minutes, before going on.
OTOH, when the same PC is started at home, on some ADSL broadband
connection, the CRL server _is_ available, but since the AD server
isn't, the login process never tries to use/verify said certificate anyway.
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
with a little ingenuity ... there is possibly two different issues,
the time that it takes to respond to person waiting ... and the
time it finalizes stuff with a few TB of spinning dust.
not because of filesystem issues ... but there was an extremely
painful and highly visable issue long ago and far away when a major
portion of the credit card authorization infrastructure was out for 18
minutes ... just about when lunch was ending on the east coast (in
this case it was due to a phone company burp and large majority of all
those little POS terminals weren't able to complete their call).
old archeological reference about motivation to improve filesystem
recovery from several hours (somewhat) because there was demonstration
of an implementation that didn't take several hours. minor multics
lore reference:
http://www.garlic.com/~lynn/99.html#53 Internet and/or ARPANET?
http://www.garlic.com/~lynn/2001g.html#52 Compaq kills Alpha
http://www.garlic.com/~lynn/2002b.html#62 TOPS-10 logins (Was Re: HP-2000F - want to know more about it)
http://www.garlic.com/~lynn/2003l.html#17 how long does (or did) it take to boot a timesharing system?
... this whole thread is somewhat deja vu for some
You seem to be implying that boot time is in some sense dependent on the
number of processors and the amount of disk space. Let's examine that.
For the number of processors, I don't get this at all. Virtually all of the
hardware integrity checking can be done in parallel on multiple procesors,
as can any processor specific initialization. Yes, there is a need to start
on one then "fan out", but the fan out number should be pretty high.
For the disk space, I think what you are really saying is not the number of
disks, for which the same parallel arguments as above apply, but the number
of files and directories. So we get down to why it should take 5 minutes to
integrity check several TB of files. ISTM that this is a file system design
problem, and there is no reason why it has to take that long. That is,
systems could be designed to be recovered in much less than 5 minutes if
that were stated as an explicit goal. I think it just hasn't been done for
"typical" systems, though as several people have pointed out, it has been
for some.
Am I wrong about any of this?
> Whether it's *fast enough*
> that is another question... :)
Precisely. My point is that it is a marketing question, not a technical
one. Such systems could be designed to recover well under five minutes if
there were a sufficient requirement for it.
--
- Stephen Fuld
e-mail address disguised to prevent spam
When I see a digital camera or a camcorder booting, I want single digits -
after the decimal point, i.e. tenths of a second. I can understand that it
takes some time to spin up some hardware (disk, tape head), but I don't
understand that it takes so much time to boot the software. Here, I do know
better - my analog camera does not have a boot time, it's just on or off.
It's also quicker to take a picture, even though it has to lift a
mechanical mirror before it can expose the film.
People usually accept inconveniences because they don't know better. I
wonder why this works here. Analog cameras don't have boot times - why do
people accept them from digital cameras? Home computers didn't have boot
times - your Atari ST or your C64 were up and running within a second or
two after pressing the switch.
A hard disk is spinning up within a few seconds (it usually is already up
and running when the BIOS beeps). A freshly booted OS on a modern PC is
transferred within a second or two - most of the RAM is free at that point.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
OK, let's do some math:
If I want to boot into a fully functional environment, then I pretty
much have to use some form of checkpointing (i.e 'hibernation'), so that
I can stream everything back in from a single sequential file:
With my current laptop I have 512 MB of RAM, and a 2.5" IDE hard disk
with a transfer rate of about 40 MB/s, which translates into about 13
seconds to roll everything back in.
For my next machine I'm going to have 2 GB of RAM, and about the same
speed disk: This means that there's no way to restart a full environment
in much less than about a minute, unless checkpoint file compression can
increase the transfer rate.
Getting the time much below this requires some way of reducing the
amount of state that must be recovered.
OTOH, simply booting a working OS from scratch _should_ be doable in a
couple of seconds, five at the outside. :-(
If you have a single system image there are *bound* to be
issues with synchronising everything up. This good stuff
does not happen for free in the real world. 512 nodes with
a ton of wiring and "things" hanging off busses will take
time, parallel or not. Hence my question about whether it
was O(1) or O(n) etc...
> For the number of processors, I don't get this at all. Virtually all of
the
> hardware integrity checking can be done in parallel on multiple procesors,
> as can any processor specific initialization. Yes, there is a need to
start
> on one then "fan out", but the fan out number should be pretty high.
> For the disk space, I think what you are really saying is not the number
of
> disks, for which the same parallel arguments as above apply, but the
number
> of files and directories. So we get down to why it should take 5 minutes
to
> integrity check several TB of files. ISTM that this is a file system
design
> problem, and there is no reason why it has to take that long. That is,
[SNIP]
> Am I wrong about any of this?
No idea mate. The real world tends to through up difficult
an unexpected gotchas, as Rob confirmed. That's one of the
reasons why I wished that I was a fly on the wall at the
BG/L design meetings. :)
Cheers,
Rupert
> With my current laptop I have 512 MB of RAM, and a 2.5" IDE hard disk
> with a transfer rate of about 40 MB/s, which translates into about 13
> seconds to roll everything back in.
> For my next machine I'm going to have 2 GB of RAM, and about the same
> speed disk: This means that there's no way to restart a full
> environment in much less than about a minute, unless checkpoint file
> compression can increase the transfer rate.
You probably don't need to load all 2Gb to be up and running -- the
system could page in whatever's necessary to get the working sets of
your applications, and page in the rest later -- perhaps just using
normal VM functions. (Disk cache would be a special case of this,
which in all probablility wouldn't be stored at all when hibernating.
With 2Gb, I think the normal case would be to have quite a bit of
cache.)
But yeah, disks are too slow, and/or memory is growing too large. And
I rather hope for faster disk, rather than less memory :-)
-kzm
--
If I haven't seen further, it is by standing in the footprints of giants
I don't see why a REboot of 512 nodes should take longer than a reboot
of one -- each node does its own, and I would hope they can signal each
other to do their thing in milliseconds, not minutes. This being reboot,
platters are spinning and hardware is warm.
Of course, one of the problems is often that warm hardware has to "cool
down" to be properly reset for a fresh boot -- when the only way to reach
a known state from an unknown state is a power-cycle, for example.
Btw, I have seen a machine that does a COLD boot in 14 seconds, of which
the 1st 10 seconds are indeed SCSI spin-up. (Reboot time is 3 seconds
up to the point where the network is connected and the user is looking
at new mail in his favourite editor. That was in 1997.
Michel.
>If I want to boot into a fully functional environment, then I pretty
>much have to use some form of checkpointing (i.e 'hibernation'), so that
>I can stream everything back in from a single sequential file:
Right -- that's Resume, not Reboot. You are trying to restore an active
environment, with several open applications, each having a large state.
Btw, I do interpret Reboot as reaching a "fully functional environment" --
an initially clear environment, however. That's why I described the
initial state I had in mind: looking at recent mail, headers already
downloaded, but contents still at the remote mail server, so I can
select what I want to do, and do it right away.
>
>OTOH, simply booting a working OS from scratch _should_ be doable in a
>couple of seconds, five at the outside. :-(
Exactly my point. And (for Bernd's environment) with the OS and primary
application in flash memory, a small fraction of a second should be enough
to reach operational state.
Btw, hibernate and resume are fine -- but I want reboot too, to restore a
clean environment from a hopelessly messed-up one.
Michel.
You better have smart block-paging then -- page-faulting through one at
a time could be a painfully slow wakeup... I guess that's what you had
in mind with "get the working sets". This is critically important!
Michel.
I do. But there is no justification for it taking MUCH longer than
for one! There are two good reasons why it should take longer (though
most of the actual ones are bad):
1) Checking the interconnect is an O(N^2) problem, and therefore
will take O(N) time, where N is the number of nodes. You are going
to check every path, aren't you?
2) You want one node to boot up to a stable state for producing
diagnostics before starting the others, or else you have the problem
of trying to diagnose issues using an unreliable base.
I was speaking about this to a colleague, and we agreed that the best
approach was the Linux one, taken further. This is close to the way
IRIX does it, too, and it was probably invented in the 1950s or 1960s.
The initial and recovery boot processes are the same, and are taken
from PROM, floppy or a fixed location on a fixed disk. You get up to
a state that allows line mode root login (and I mean line mode, not
curses capable), at least from the console and via the preferred
networking (SSH, in the case of sane sites). This uses just enough
memory for such use, a single CPU, no disks (unless for loading),
but allows access to such peripherals.
That SHOULD be fast, as the environment can be preconfigured, and
there is very little of it. 1-5 MB of image etc. and (say) 16 MB of
memory, a very simple keyboard and screen driver (no mouse), and a
networking driver. Drivers for other peripherals CAN be loaded, but
could be loaded on demand.
A full boot uses that as a basis to start up everything else. CPUs,
disks etc. can be started in parallel, and you have a proper system
to control their serialisation. With multiple CPUs and secondary
disks (usually the large filesystems), you can check them in parallel
with doing the actual boot, and attach them as they become ready.
Yes, that needs a slight extension to the facilities in /etc/init.d/.
Regards,
Nick Maclaren.
>I do. But there is no justification for it taking MUCH longer than
>for one!
Oh, I agree. In large systems there may also be a lot of genuine work to
be done before the system can be used in full production: mounting lots of
file systems, warming up database caches, etc. Even then it is useful to
reach a minimally useful state quickly (as you described).
What I object to is the large amount of *unnecessary* work that is often
done at boot time, namely a complete "sysgen", which is really a very large
constant propagation. Configuration checking is ok (and necessary, and ought
to be fast), but configuration exploration should not be. That should be
reserved for an explicit configuration boot ("install", or "configuration was
changed"). On top of that one often sees page-at-a-time bringup (hence the
excessive disk rattling heard on most booting PCs). I think Microsoft learned
this lesson and now groups bringup files for noticeably faster boot times.
Michel.
No. Using the paging functions would totally destroy the streaming
performance of the disk. As a first estimate, I'd say you can afford at
least 10X the amount of data as a single streaming read, vs. paging in
just what's needed.
> your applications, and page in the rest later -- perhaps just using
> normal VM functions. (Disk cache would be a special case of this,
> which in all probablility wouldn't be stored at all when hibernating.
> With 2Gb, I think the normal case would be to have quite a bit of
> cache.)
Normal, yes. Always? Far from it!
I often edit big to huge multilayer panoramas in PhotoShop, in which
case all available ram is indeed in use for editing.
>
> But yeah, disks are too slow, and/or memory is growing too large. And
> I rather hope for faster disk, rather than less memory :-)
Yeah! :-)
> ke...@ii.uib.no wrote:
>> You probably don't need to load all 2Gb to be up and running -- the
>> system could page in whatever's necessary to get the working sets of
> No. Using the paging functions would totally destroy the streaming
> performance of the disk. As a first estimate, I'd say you can afford
> at least 10X the amount of data as a single streaming read, vs. paging
> in just what's needed.
Right. 10ms seek at 40Mb/s is 400Kb. Factor of 100 if you do 4Kb
pages one seek at a time :-)
But I was thinking that the system has some idea about the working
sets of applications anyway (after all, it has to be able to chose
pages to page out when necessary), and that it could virtually page
out as much as possible to one chunk, and store the remaining (active
working set) in a separate chunk. I've no idea how well this would
work, or how practical it would be. Given the state of Linux' paging
performance on my laptop, I'm not overly optimistic. And you're
probably most interested in getting on with your 2Gb PS edit, which
means you have no alternative but to wait for it all to load :-(
Faster disks it is, then. When can we have them?
I very much doubt that the 4k pages are adequate today. How fast would Linux
swap if it had half megabyte pages (as the rule of thumb indicates)? I
don't mean half a megabyte for the smallest VM allocation granularity
(there, 4k is nice to have), but as typical swap write granularity.
Photoshop and the GIMP have larger tiles, and write them out into files -
because this is faster than using the host OS paging system. This is a
perverted situation, since the host OS should have less overhead to swap
pages into a swap partition than a user program swapping tiles into a file.
> When you say "In fact, we have no theory", that's not a
> bad starting point. It helps to explain why there's the
> million dollar question: does P equal NP?
Yes, it does, for sufficiently small values of N. ;)
> I very much doubt that the 4k pages are adequate today. How fast would Linux
> swap if it had half megabyte pages (as the rule of thumb indicates)? I
> don't mean half a megabyte for the smallest VM allocation granularity
> (there, 4k is nice to have), but as typical swap write granularity.
> Photoshop and the GIMP have larger tiles, and write them out into files -
> because this is faster than using the host OS paging system. This is a
> perverted situation, since the host OS should have less overhead to swap
> pages into a swap partition than a user program swapping tiles into a file.
I'd expect most OSs today pages by doing fairly large sequential reads
(prefetch) and writes (gather multiple pages for one write).
Larger pages may help. But lets face it, nobody uses paging today. All
paging is good for is to tell your performance monitor that you need
more RAM, without actually killing systems. Larger pages could still
relieve TLB related performance hits though.
IMHO paging needs to die.
Cheers
Martin
Yes, but whoever tried producing that probably got shot or went out
of business long time ago. Can you think of any real system that
doesn't use page groups for paging?
[snip]
>
> Faster disks it is, then. When can we have them?
>
You can buy solid state if you want lots of speed.
> -kzm
--
Sander
+++ Out of cheese error +++
This is what I meant by good block-paging in a previous post. The technique
was introduced (afaik) in VM/HPO in the mid-1980s, and is what gives IBM's
VM timesharing system subsecond response times for thousands of concurrent
interactive users on one mainframe.
Michel.
It was reinvented by IBM then, true, but the technique was at least
15 years older. IBM was VERY late learning about virtual memory.
Regards,
Nick Maclaren.
"big pages" for MVS & then VM ... early 1980s.
basically, real storage was 4k pages .... but a 3380 track-size (40k,
10 pages) was defined for tranfers. ten pages from virtual address
space were aggregated into a big page for writting. as program
referrence pattern changed ... the membership in big page could also
change (i.e. membership in the same big page tended to be ten virtual
pages that were all referenced to same prior recent interval). a fault
on any 4k page in any big page ... would fetch all members of the big
page.
no home location on disk was preserved ... so any selection of pages
for replacement involved forcing a write of the pages (even if they
hadn't been changed during the most recent stay in memory)
basically 3380 (compared to 3330) increased transfer rate by about a
factor of ten, but only increased arm access rate by possibly factor
of three. big pages tended to trade-off the extra transfer rate
resource against number of transfers aka big pages might tend to
double the overall number of pages transferred, but significantly
decreased the number of transfers. note that some of the really
impressive (4k) paging rates for the big page systems is because of
the increases in transfers caused by the big page methodology
(including the increase in writes caused by not keeping a home
location for non-changed pages).
misc. past discussions of big pages ... includes some references to
observation that over a 15 year period that relative system disk
thruput technology had declined by possibly a factor of ten aka
processor & memory performance increased by factor of fifty, while
disk access thruput only increased by possibly a factor of five:
http://www.garlic.com/~lynn/2002c.html#29 Page size (was: VAX, M68K complex instructions)
http://www.garlic.com/~lynn/2002c.html#48 Swapper was Re: History of Login Names
http://www.garlic.com/~lynn/2002e.html#8 What are some impressive page rates?
http://www.garlic.com/~lynn/2002e.html#11 What are some impressive page rates?
http://www.garlic.com/~lynn/2002f.html#20 Blade architectures
http://www.garlic.com/~lynn/2002l.html#36 Do any architectures use instruction count instead of timer
http://www.garlic.com/~lynn/2002m.html#4 Handling variable page sizes?
http://www.garlic.com/~lynn/2003b.html#69 Disk drives as commodities. Was Re: Yamhill
http://www.garlic.com/~lynn/2003d.html#21 PDP10 and RISC
http://www.garlic.com/~lynn/2003f.html#5 Alpha performance, why?
http://www.garlic.com/~lynn/2003f.html#9 Alpha performance, why?
http://www.garlic.com/~lynn/2003f.html#16 Alpha performance, why?
http://www.garlic.com/~lynn/2003f.html#48 Alpha performance, why?
http://www.garlic.com/~lynn/2003g.html#12 Page Table - per OS/Process
Look at the following piece of creative use of hibernation:
http://www.vci.com/products/instant_on/ReadyOn_Overview.pdf
Could be a step in the direction you're pursuing.
snip
> Faster disks it is, then. When can we have them?
Two answers. First, disk transfer rate gets faster every year due to
increased areal density, so you get some increase by just going to a newer
disk, and of course, if highest transfer rate is your main concern, you
could choose a disk with that parameter optimized, as the cost of say higher
power useage, or more dollars outlay.
Second, You can get as high a transfer rate as you want by striping the data
across N disks. Yes, cost increases linearly with the number of disks (but
so does capacity), but if you want high transfer rate, you can easily get
> <ke...@ii.uib.no> wrote in message news:egn0a0s...@havengel.ii.uib.no...
>
> snip
>
>> Faster disks it is, then. When can we have them?
[snip]
> Second, You can get as high a transfer rate as you want by striping the data
> across N disks. Yes, cost increases linearly with the number of disks (but
> so does capacity), but if you want high transfer rate, you can easily get
> it.
I was under the same impression about striping until I read the recent Sun
BluePrint "Solaris Volume Manager Performance Best Practices" by Glen
P. Fawcett (http://www.sun.com/solutions/blueprints/1103/817-4368.html).
OK, it addresses a particular software product (Solaris Volume Manager) and
some people say it is an advertisment for hardware RAID products, but here
is what is says on page 3:
"The main problem with Solaris Volume Manager striping is split I/O.
Splitting of an individual I/O operation to more than one disk degrades
performance. There are several ways of avoiding this:
* Use hardware striping whenever possible. This prevents SVM software or
the OS from having to split an I/O operation.
* Increase the stripe width to lessen the frequency of splitting an I/O
operation.
etc etc."
My impression is that striping is done in order to split I/O accross
several disks increasing the transfter rate.
Bye, Dragan
--
Dragan Cvetkovic,
To be or not to be is true. G. Boole No it isn't. L. E. J. Brouwer
!!! Sender/From address is bogus. Use reply-to one !!!
(snip)
> "big pages" for MVS & then VM ... early 1980s.
> basically, real storage was 4k pages .... but a 3380 track-size (40k,
> 10 pages) was defined for tranfers. ten pages from virtual address
> space were aggregated into a big page for writting. as program
> referrence pattern changed ... the membership in big page could also
> change (i.e. membership in the same big page tended to be ten virtual
> pages that were all referenced to same prior recent interval). a fault
> on any 4k page in any big page ... would fetch all members of the big
> page.
If I understand it, this is what MVS calls swapping, instead of paging?
MVS 3.8j has both swap disks and paging disks, and I wasn't completely
sure what to do with each. I believe that both are required.
> no home location on disk was preserved ... so any selection of pages
> for replacement involved forcing a write of the pages (even if they
> hadn't been changed during the most recent stay in memory)
>
> basically 3380 (compared to 3330) increased transfer rate by about a
> factor of ten, but only increased arm access rate by possibly factor
> of three. big pages tended to trade-off the extra transfer rate
> resource against number of transfers aka big pages might tend to
> double the overall number of pages transferred, but significantly
> decreased the number of transfers. note that some of the really
> impressive (4k) paging rates for the big page systems is because of
> the increases in transfers caused by the big page methodology
> (including the increase in writes caused by not keeping a home
> location for non-changed pages).
Does it use RPS to start the transfer at the next block on the disk,
even if it is not at the beginning of the track?
Also, how does that compare to the 2305?
-- glen
I believe that's just a response to people being silly and using
logical block sizes for the stripe of smaller than the logical
block size in the Solaris UFS filesystem (8k). You just need to
match or exceed the filesystem block size with the RAID striping
(so, 8k, 16k, 32k, 64k etc) or else you're wasting time.
Ideally, 1 IO at the filesystem level equals one IO in the
volume manager... If the volume manager has to do 2-8 IOs per
filesystem IO then things slow down.
-george william herbert
gher...@retro.com
> Yes, but whoever tried producing that probably got shot or went out
> of business long time ago. Can you think of any real system that
> doesn't use page groups for paging?
I don't really know any system on that level of detail. My Red Hat
laptop is incredibly slow when memory gets exhausted (which happens a
lot when I run some of my own stuff). After my app dies, switching to
another application takes a long time (many seconds, maybe tens of).
I've ascribed it to fragmented paging since ten seconds should be
clost to enough to page in the complete memory (512Mb), but if anybody
has a better explanation, I'm all ears.
(Another anoyance is that even after a long period of idleness,
applications aren't paged in. It would be neat to have the memory
repopulated stealthily when the computer is idle, but I guess it's
hard to predict what would be the most useful targets. An inverse
LRU, perhaps? Too much bookkeeping?)
>> Faster disks it is, then. When can we have them?
> You can buy solid state if you want lots of speed.
Hmm...2Gb flash sounds rather expensive. I thought modern RAM could
be kept sleeping with very low power consumption. Perhaps a laptop
could keep (a part of) its memory sleeping while hibernating?
I believe that it always used RPS to start at the same record position
for transfers. There had been some discussion of using CKD search
non-equal (aka low or high rather than equal) which has been used for
database logging doing full track writes ... which would start
transfer at the first record that came under the head. then on read,
it would do a similar search ... and then be able to reconstruct the
ordering by additional information (in logging, it could be included
in the data itself, for paging, it might have to be a read key
operation to figure out what record position it was at).
There may have been some later implementation that used trick for
starting transfer at first encountered record.
as to MVS paging & swapping. Swapping has tended to be an operation
that transfered all pages for a process at some scheduling event
(either all pages in or all pages out). This is a distinct type of
scheduling event independent of the page fault paradigm. The swapping
logic can be independent of the disk transfer paradigm.
I believe that MVS paging was single 4k transfers. I believe that
swapping could be both the logical scheduling operation ... as well as
the disk area for big pages.
The disk area for big pages (if it is referred to as swapping) has
been typically defined as ten times larger disk space area than would
be actually occupied by allocated pages. Allocation tened to be a
(slowly) moving cursor where disk space in front of the cursor is
(mostly) empty. That guarentees minimum arm movement for writes.
Faults tended to be for big pages in trailing area behind the cursor.
When a big page was read, the corresponding disk space is always
deallocated (which helps keep future activity in the region of the
cursor position ... and some increase in the number of pages that have
to be written back to disk).
Other, more traditional swapping and/or other large chunk transfers
have tended to be strictly contiguous virtual memory locations. Big
pages was slightly more adaptive than strictly contiguous virtual
memory paradigm .... i.e. the membership in big page tended to be
aggregation of 4k pages (not necessarily contiguous) that were being
used together (as opposed to a strictly 40k contiguous section of
virtual memory).
Big pages tended to transfer larger number of pages than strictly 4k
oriented operations (possibly double the number of pages, but the
increase in pages transfered was more than offset by the reduction in
the number of unique transfer operations). However, big pages would
tend to have much less transfer than a paradigm purely based on
contiguous virtual memory location.
2305 is fixed head disk ... so all records are logical contiguous w/o
requiring arm motion ... where-as big pages attempted to optimize the
arm access efficiency of 3380s. 3380s offered a lot larger space area
at much lower price/byte than 2305.
A single 4k page fault paradigm mapped to 2305 could get multi-record
transfers when servicing some number of independent processes (think
cache misses in a mutli-threaded cpu architecture). However, any
single process would still encounter the overhead and latency of
moving one page at a time (it is at the system level that you see the
efficiency of large transfers).
A single 4k page fault paradigm might have a process going thru eight
distinct (4k) page faults to bring in 32k bytes of virtual memory in
eight distinct transfer operation. Mapped to a big page paradigm might
have the process bringing in ten 4k pages on the first fault. In this
sense, big page paradigm is somewhat trading off both real storage
resources as well as disk transfer rate resources to optimize arm
access resources.
Theoritically, there is some possibility of mapping big page operation
to 2305 ... but there again, you may be trading off space resources on
2305 against accesses. Since space is much more limited on 2305, and
accesses is less of an issue, it might not be a good trade-off. On the
other hand, big pages, allows 3380s to approach (and/or possibly
exceed in some areas) thruput of 2305 and be able to take advantage of
much better price/byte as well as much large space.
possibly more than you wanted to know. previous posting had some
references to other detailed description of big pages:
http://www.garlic.com/~lynn/2003o.html#61
some past discussions regarding "dup/no-dup" management of page space
... i.e. when page is read from disk, is the location kept allocated
and therefor can save on subsequent write if the page in memory is
replaced but never is changed ... or is space occupied by a read page
always de-allocated ... which then always required rewriting page when
it is subsequently replaced (even if not changed). Most of these
discussions were with respect to optimizing used of limited disk space
(not having a duplicate of a page both on disk and in real storage).
The 3380 big page scenario was part of strategy of trying to help
improve disk arm locality:
http://www.garlic.com/~lynn/93.html#12 managing large amounts of vm
http://www.garlic.com/~lynn/93.html#13 managing large amounts of vm
http://www.garlic.com/~lynn/94.html#9 talk to your I/O cache
http://www.garlic.com/~lynn/2000d.html#13 4341 was "Is a VAX a mainframe?"
http://www.garlic.com/~lynn/2001i.html#42 Question re: Size of Swap File
http://www.garlic.com/~lynn/2001l.html#55 mainframe question
http://www.garlic.com/~lynn/2001n.html#78 Swap partition no bigger than 128MB?????
http://www.garlic.com/~lynn/2002b.html#10 hollow files in unix filesystems?
http://www.garlic.com/~lynn/2002b.html#16 hollow files in unix filesystems?
http://www.garlic.com/~lynn/2002b.html#19 hollow files in unix filesystems?
http://www.garlic.com/~lynn/2002b.html#20 index searching
http://www.garlic.com/~lynn/2002e.html#11 What are some impressive page rates?
http://www.garlic.com/~lynn/2002f.html#20 Blade architectures
http://www.garlic.com/~lynn/2002f.html#26 Blade architectures
http://www.garlic.com/~lynn/2003f.html#5 Alpha performance, why?
--
>Sander Vesik <san...@haldjas.folklore.ee> writes:
>
>> Yes, but whoever tried producing that probably got shot or went out
>> of business long time ago. Can you think of any real system that
>> doesn't use page groups for paging?
>
>I don't really know any system on that level of detail. My Red Hat
>laptop is incredibly slow when memory gets exhausted (which happens a
>lot when I run some of my own stuff). After my app dies, switching to
>another application takes a long time (many seconds, maybe tens of).
>I've ascribed it to fragmented paging since ten seconds should be
>clost to enough to page in the complete memory (512Mb), but if anybody
>has a better explanation, I'm all ears.
>
>(Another anoyance is that even after a long period of idleness,
>applications aren't paged in. It would be neat to have the memory
>repopulated stealthily when the computer is idle, but I guess it's
>hard to predict what would be the most useful targets. An inverse
>LRU, perhaps? Too much bookkeeping?)
>
>>> Faster disks it is, then. When can we have them?
>
No explanations, but you might get a partial explanation by going for
a light weight window manager. Usually makes a big difference. Easy
to do running a remote x-terminal (obviously).
The sense in which it is a partial explanation is that I think alot of
the shuffling that is going on has to do with CORBA and other
intermediate layers having to do with the user interface and not your
application itself.
What I *have* been through is dumping Nautilus to draw the desktop and
switching to gmc to draw the desktop on slower machines when Nautilus
first came out. Nautilus has gotten alot better since then, gmc is no
longer supported, and so I don't know tell you what details to fiddle
with or what to replace with what, but the default RH user interface
is a memory and cycle hog.
When I log into one of my boxes, even over 100Mbit ethernet, I'm much
better off using a local x-server, twm, and xnc as a file manager.
All the functionality without the eye candy and without the overhead.
You can do the same thing on a display at the console (probably better
to define a new display and leave the default :0.0 display intact).
I'v been though that, but I couldn't walk you through the details.
If you go to the trouble, it will probably fix your perceived
problems.
RM
And cost effective is often a terrible metric. To use a ridiculous
american example, the engineers designing the ford pinto station wagon
calculated that the cost of settling a few wrongful death lawsuits was
less than the manufacturing cost of the modification to all those cars
to prevent fires in a small number of high speed rear end collisions.
How do you calculate the cost effectiveness of a program to assure (or
attempt to assure) the reliability of a bunch of nuclear weapons that
there is a high probability will never be used?
What was the cost effectiveness of the U2? SR71? B2? How about the
Glomar Explorer? Or the project to tap the Soviet under sea cable by
sneaking in by sub. How about non black programs? What was the cost
effectiveness of keeping all those troops in Germany for 50 years?
Didn't do a thing for me, cost me money. guess that makes in negative
infinity. :-)
del cecchi
AS400/i series has/had a feature called CPM, continuously powered
mainstore. the memory was kept alive even in battery backup, although
the batteries were somewhat larger than laptop batteries. :-)
And are you sure that laptops don't keep memory alive while suspended?
Mine comes up awful fast with fnF4.
del cecchi
"suspended", yes - that state basically turns off peripherals, but keeps
memory and processor powered. You'll still run out of power after some
time - possibly 12 hours or so.
"hibernate" dumps state on disk, and turns everything off. Thus, when
awakening the machine, it needs to read that state back in - the discussion
revolved around how much of that state needs to be read initially, and
how much can be transferred to later to cut down the press-switch-to-
press-CTRL-ALT-DEL latency.
On my laptop, the hibernate file is almost contiguous (two chunks), and
I'd guess that reading state takes about half of above-defined latency,
perhaps somewhat less. It's fast enough, and certainly much better than
rebooting and logging in over and over again on each use.
Jan
VMS did similar things at a similar time. I do believe that it usually
keeps page-file allocation, though, instead of throwing it away as you
describe for VM. I can see that this could be a performance advantage.
Jan
Although I'm sure parts of that story are urban legend or apocryphal,
it is a legitimate example of an engineering trade-off - hey, such things
happen all the time, even if the public - particularly the always-queasy
US public - render an outcry when such issues are pointed out to them.
> How do you calculate the cost effectiveness of a program to assure (or
> attempt to assure) the reliability of a bunch of nuclear weapons that
> there is a high probability will never be used?
It's difficult, of course...and the real problem is in your parenthetical
"attempt to assure", i.e., we really can't check whether any of those
programs actuall achieved that more narrowly defined goal.
But I'm sure that even within this scope, one can guesstimate whether the
approach taken by DoE/DoD for this purpose was appropriate or not. The more
secrecy involved, the less ability you have to do that.
> What was the cost effectiveness of keeping all those troops in Germany
> for 50 years? Didn't do a thing for me, cost me money. guess that makes
> in negative infinity. :-)
Oh sure it did a lot of things for you - Germany's a very good market for
your employer, and we lent (the collective) you the money you need to finance
your double deficit 8-).
Jan
No - EARLIER! Fancy positioning was introduced for 3330 support,
and was more-or-less eliminated in the massive MVT 21.6 "performance
enhancement", which was the same one that broke chained scheduling
and more-or-less abolished PCI. There were good reasons for many
of the changes, but chained scheduling need not have been broken as
badly as it was.
Regards,
Nick Maclaren.
<snip>
>
>> How do you calculate the cost effectiveness of a program to assure (or
>> attempt to assure) the reliability of a bunch of nuclear weapons that
>> there is a high probability will never be used?
>
>It's difficult, of course...and the real problem is in your parenthetical
>"attempt to assure", i.e., we really can't check whether any of those
>programs actuall achieved that more narrowly defined goal.
>
>But I'm sure that even within this scope, one can guesstimate whether the
>approach taken by DoE/DoD for this purpose was appropriate or not. The more
>secrecy involved, the less ability you have to do that.
>
I am sure that there must be at least one person inside the stockpile
stewardship program who must be as hysterical as I would be if anybody
actually imagined that such a mission could be accomplished with any
finite expenditure of resources and without detonating warheads.
In that sense, the comment that was made to Tom Knight that they need
every penny for critical program needs makes perfect sense to me. All
the resources of the galaxy wouldn't accomplish what they have set out
to accomplish.
By the same token, skimming one percent for basic research wouldn't
change the probablility that the stockpile is effective at an
acceptible level because the one sigma error bars of that estimate
have to skim somewhere near zero ... if the estimate isn't made by
cooking the books.
When the Challenger blew up--bad project management, faulty reasoning
about probabilities--the military had to go to a backup program of
expedible launch vehicles, and there followed a string of failures
with *them*. There was a brief period when we couldn't seem to get
anything into orbit. No special access required for this knowledge.
Just follow Aviation Week.
A thousand times a petaflop wouldn't make the stockpile stewardship
program believable, but it might well accomplish some other goals that
would make the United States and its allies safer and more powerful if
those computing resources were put to a mission that is actually
achievable.
>> What was the cost effectiveness of keeping all those troops in Germany
>> for 50 years? Didn't do a thing for me, cost me money. guess that makes
>> in negative infinity. :-)
>
>Oh sure it did a lot of things for you - Germany's a very good market for
>your employer, and we lent (the collective) you the money you need to finance
>your double deficit 8-).
>
The US actually used atomic weapons that brought to a close a war that
was truly global in scope. It left the the victors who had delivered
the awesome might to bear: the US and the Soviet Union, face to face
with teeth bared, and nowhere so much so as Germany. Who was to
blame, what should have been done, who cares...it's over. People
everywhere paid a high price.
Instead of reaping a peace dividend as result of *finally* bringing
that conflict and its after-effects to a close, we now seem to be
facing a terrorism dividend. I don't imagine but that the US is using
whatever infrastructure it possesses to address those issues. For all
I know, the powers that be have come to the same conclusion as I have
about the stockpile stewardship program and are planning to use their
petaflop to address bioterrorism.
In any case, it seems that the awful distortions of US society and of
scientific research in the US that WWII and the cold war begat are
destined to live on.
I wanted to make it clear, and I have done so, that the biggest
computer in history is to be deployed in a manner that perpetuates
that distortion.
RM
RPS wasn't used for picking first record rotating under the head,
CKD data was.
lots of CKD disk channel programs would loop for SEARCH-ID equal ...
to start transfer at a specific record. This paradigm was originally
from early 360 when memory was really expensive and nearly
non-existant in outboard boxes. As a result, SEARCH-ID equal would
loop in the controller as each record past under the head ... but use
the id parameter in the processor memory. That met that the controller
was tied up for the duration of the loop (until the record past under
the head) as well as the channel interface. This was an inefficiency
in a system when there might be 16 drives per controller (having
controller dedicated/busy to specific drive operation for extended
period) and multiple controllers per channel (having the channel
dedicated/busy to specific drive operation for extended period).
This was during a period when (excess) transfer (i/o) resources were
traded off for limited memory resources. By the middle 70s the
resources constraints had flipped .... memory was much more abundant
than i/o resources ... and therefor CKD represented a resource
trade-off that was exactly the opposite of the environment.
RPS was introduced to sort of mitigate the problem. The system could
have a good idea of the sector position (on the track) for start of
each specific record. A channel program was provided that would start
the operation and then disconnect the controller and channel until the
drive sensed that specific sector position and then attempt to
reconnect & resume the channel program. If all things went well, the
search-id equal would be executing exactly when the correct record
start was passing under the head.
The logging optimization is just the opposite situation, it wanted to
start transfer (read or write) as soon as the start of any record
rotated under the head (and would write a full tracks worth of
records). It would use a search-id parameter that would always be
valid for all records. RPS was used to disconnect to avoid dedicating
resources for an extended search-id loop until a specific record
(sector position) rotated under the head. The logging optimization
didn't care what record rotated under the head (and therefor didn't
care what the sector position was) ... it just wanted to start
transfer at the start of any record.
Now, the '60s (and sysetm/360) use of CKD led to some advantage taken
of the fact that the search-id field was always fetch from main
memory. One was writting a self-modifying channel program that
possibly read the search-id field (for a subsequent channel
instruction) from the disk. As a result, it precluded prefetching
instructions and parameters (as a optimization method to avoid the
tie-up of the resources).
For a little drift from the
http://www.garlic.com/~lynn/2003o.html#54 An entirely new proprietary hardeware strategy
thread
The mainframe ESCON (fiber-optic) implementation had been knocking
around POK since the 70s. However, it exactly emulated the half-duplex
syncronous, bus&tag copper cable operation ... in part because of the
non-prefetching rules that were needed to support the channel program
modification on the fly capability.
As referenced in the above, SLA for the RS/6000 was sort of an ESCON
derivative; and in addition to being about ten percent higher transfer
rate and cheaper optical drivers ... it was also full-duplex,
asyncronous (in that respect shared much more in common with FCS than
ESCON).
As mentioned in previous posts about that era ... HiPPI standard was
somewhat driven by LANL as a standardization of the Cray half-duplex,
parallel copper channel and FCS stnadard was somewhat driven by LLNL
as a fiber-optic standardization of the Ancor, non-blocking switch
installation they had (adding a little more drift to another thread).
The other extreme optimization of the 360s paradigm that even carries
over until today was the use of multi-track search for finding things
on disk (as means of minimizing real-storage caching of index
information). This was implemented in two os/360 faclities, the drive
VTOC (i.e. directory of datasets/files on the disk) and library PDS
(directory of members in a library dataset/file). Multi-track search,
extended the search paradigm to scanning all records on all tracks on
a cylinder for a matching entry. For 3330, this met that an
unsuccusful search could take 19 revolutions (@ 60/second) during
which time the (shared) channel and (shared) controller were tied up
and dedicated to the operation (aka 1/3rd second per). In pathelogical
situations this could have severe performance penalty. All of this to
avoid tieing up (1960s) "scarce" real storage for caching of highly
used directory information. Note that RPS didn't do anything for the
multi-track search operations because they kept no pre-knowledge about
where the record position (that they were searching for) might be.
some past drifts about HiPPI, LANL, FCS, LLNL, ancor, escon, sla,
cdrom, etc.
http://www.garlic.com/~lynn/2001f.html#66 commodity storage servers
http://www.garlic.com/~lynn/2001m.html#25 ESCON Data Transfer Rate
misc. past threads involving multi-track searches
http://www.garlic.com/~lynn/93.html#29 Log Structured filesystems -- think twice
http://www.garlic.com/~lynn/94.html#35 mainframe CKD disks & PDS files (looong... warning)
http://www.garlic.com/~lynn/97.html#16 Why Mainframes?
http://www.garlic.com/~lynn/97.html#29 IA64 Self Virtualizable?
http://www.garlic.com/~lynn/99.html#75 Read if over 40 and have Mainframe background
http://www.garlic.com/~lynn/2000f.html#18 OT?
http://www.garlic.com/~lynn/2000f.html#19 OT?
http://www.garlic.com/~lynn/2000f.html#42 IBM 3340 help
http://www.garlic.com/~lynn/2000g.html#51 > 512 byte disk blocks (was: 4M pages are a bad idea)
http://www.garlic.com/~lynn/2000g.html#52 > 512 byte disk blocks (was: 4M pages are a bad idea)
http://www.garlic.com/~lynn/2001c.html#17 database (or b-tree) page sizes
http://www.garlic.com/~lynn/2001d.html#60 VTOC/VTOC INDEX/VVDS and performance (expansion of VTOC position)
http://www.garlic.com/~lynn/2001d.html#64 VTOC/VTOC INDEX/VVDS and performance (expansion of VTOC position)
http://www.garlic.com/~lynn/2001l.html#40 MVS History (all parts)
http://www.garlic.com/~lynn/2002.html#5 index searching
http://www.garlic.com/~lynn/2002.html#6 index searching
http://www.garlic.com/~lynn/2002.html#10 index searching
http://www.garlic.com/~lynn/2002d.html#22 DASD response times
http://www.garlic.com/~lynn/2002f.html#8 Is AMD doing an Intel?
http://www.garlic.com/~lynn/2002g.html#13 Secure Device Drivers
http://www.garlic.com/~lynn/2002l.html#47 Do any architectures use instruction count instead of timer
http://www.garlic.com/~lynn/2002l.html#49 Do any architectures use instruction count instead of timer
http://www.garlic.com/~lynn/2002n.html#50 EXCP
http://www.garlic.com/~lynn/2002o.html#46 Question about hard disk scheduling algorithms
http://www.garlic.com/~lynn/2003.html#15 vax6k.openecs.org rebirth
http://www.garlic.com/~lynn/2003b.html#22 360/370 disk drives
http://www.garlic.com/~lynn/2003c.html#48 "average" DASD Blocksize
http://www.garlic.com/~lynn/2003f.html#51 inter-block gaps on DASD tracks
http://www.garlic.com/~lynn/2003k.html#28 Microkernels are not "all or nothing". Re: Multics Concepts For
http://www.garlic.com/~lynn/2003k.html#37 Microkernels are not "all or nothing". Re: Multics Concepts For
http://www.garlic.com/~lynn/2003m.html#56 model 91/CRJE and IKJLEW
> In article <br4okn$dfs$1...@news.btv.ibm.com>,
> ha...@watson.ibm.com (hack) writes:
> |>
> |> I find it positively perverse that boot times have been increasing
> |> as machines have gotten faster...
> |>
> |> P.S. I think the trend has finally been reversed. For one thing,
> |> hardware improvements have been such that even the grossest
> |> software bloat has been unable to keep up with it.
>
> Oh, yeah? You underestimate the ingenuity of the software
> community.
Definitely. Parsed any XML recently?
--
Greg Pfister
Oops. TLA recall failure. Thanks.
Regards,
Nick Maclaren.
> In article <br5b21$e9q$1...@news.btv.ibm.com>, hack <ha...@watson.ibm.com> wrote:
>
>>In article <10709849...@saucer.planet.gong>,
>>Rupert Pigott <r...@dark-try-removing-this-boong.demon.co.uk> wrote:
>>
>>>"hack" <ha...@watson.ibm.com> wrote in message
>>>
>>>>The word "only" suggests you think 5 minutes is good!
>>>
>>>I think it would be a great boot time for a 512 node machine
>>>with a few TB of spinning dust. Whether it's *fast enough*
>>>that is another question... :)
>>
>>I don't see why a REboot of 512 nodes should take longer than a reboot
>>of one -- each node does its own, and I would hope they can signal each
>>other to do their thing in milliseconds, not minutes. This being reboot,
>>platters are spinning and hardware is warm.
>
>
> I do. But there is no justification for it taking MUCH longer than
> for one! There are two good reasons why it should take longer (though
> most of the actual ones are bad):
>
> 1) Checking the interconnect is an O(N^2) problem, and therefore
> will take O(N) time, where N is the number of nodes. You are going
> to check every path, aren't you?
[snip]
Careful here. "Every path" is a rather broad statement, and it doesn't
have to be limited O(N^2). Many network topologies have lots of
redundancy, with many different paths between nodes. Think hypercubes,
or torus networks as easy-to-visualize examples; other networks are
often configured with less obvious redundancy.
What this means is that you don't check paths unless you're completely
paranoid; and it's structly not necessary, anyway. You check that all
the links work, and each switch's routing individually works, and
you're done. Those checks are usually around O(N) or O(NlogN) each,
and doing it in parallel with N processors reduces to constant time
(or with a logN multiplier).
--
Greg Pfister
Del, fnF4 is suspend, not hibernate. Suspend is continuously powered
memory; it still eats the battery, but very slowly. fnF12 is
hibernate; it puts memory contents on disk (if you've enabled it,
including allocating the file on disk); it eats no battery at all.
--
Greg Pfister
Not yet. I'm still waiting for it to finish. :-)
I've seen a server spend 70% to 90% of its time
in XSLT processing.
Bob S
That is a fair point, though I was obviously thinking about only the
combinations of endpoints! Certainly, you may have to check multiple
paths between a pair of nodes in many cases.
>What this means is that you don't check paths unless you're completely
>paranoid; and it's structly not necessary, anyway. You check that all
>the links work, and each switch's routing individually works, and
>you're done. Those checks are usually around O(N) or O(NlogN) each,
>and doing it in parallel with N processors reduces to constant time
>(or with a logN multiplier).
And why do you think that we have so much trouble with systems passing
boot and then failing when we try to use them?
Yes, I agree that you can't check every combination, but you do at
least need to exercise all of the major paths, and to check that the
switching is actually sending packets to where they are intended to go.
For example, consider the classic error where one input port is losing
one bit of the destination address. I sincerely hope that your tests
would pick that up, and that makes the testing O(N^2).
Regards,
Nick Maclaren.
Yes. And for the case we have been talking about, reading back the whole
memory from disk where it was written in a previous hibernate operation, the
I/Os would be very long anyway, so we aren't doing any additional read
operations, just doing multiple of them in parallel.
Note that since striping primarily helps transfer rate (1), it only really
helps where transfer time is a significant component of I/O time. This is
only on long transfers. So you want to strip size to be pretty big to avoid
having to do "extra" I/Os on short transfers. This is essentially restating
what you said above.
(1) Striping alos has some advantage in tending to equalize the workload
across multiple drives and thus reducing queuing delay, but that isn't
relevant to the situation we are discussing here.
(snip)
> RPS wasn't used for picking first record rotating under the head,
> CKD data was.
The one I was thinking about, though I am not sure how it would
work, would be to do something like READ SECTOR to find the
current position, and then you would know which block to start
writing at. It seems, though, that READ SECTOR only works after
another READ operation.
If you do READ COUNT you can find the number of the next block,
but too late to write that one.
-- glen
cheat, start writing at the first record ... write track full of
records and then do the read after writing the last record ... which
should be the information of the first record.
Yes ... look at what they're simulating in the colour plots from the
Blue Gene conference at
http://www.lanl.gov/asci/platforms/bluegene/agenda.html - shock-fronts
hitting voids in explosive, grain boundaries in explosive, how exactly
TNT behaves when it cooks off, how things tend to shatter from internal
explosions.
I can't see what they gain from that level of simulation that they
can't gain from lab tests, and even less far can I see how "simulate
exactly how the shock-front will propagate through this explosive that
has gone bubbly" is preferable in terms of trusting the result to "oh
bugger, the explosive has gone bubbly; starting with warheads AA00
through AA99, start shipping them back to Sandia and replace the
explosive with one that hasn't gone bubbly".
I suppose that supercomputers are not very expensive by Sandia's
standards; if a simulation taking a month on a $50M computer tells you
that you *don't* have to spend ten thousand man-years and create
implausible volumes of radioactive waste dismantling, repairing and
reassembling every warhead in the fleet, you're well ahead.
Tom
What I don't want to do is pay the cost of a diagnostic boot -- nay, a
full configuration-exploring boot -- each and every time, because the
alternative is not available.
Michel.
Here's why it's important.
One, lab tests of explosives are in fact really hard.
*particularly* when you're talking about trying to test
specific defects and aging behaviour of the material.
Introducing a given sized defect into the middle of a
solid block of plastic-bonded explosive which has been
sitting on the shelf for 25 years is impossible,
much less telling what differences there will be
due to random environmental effects and radiation
in the deployed weapons.
Two, lab tests of explosives are in fact really hard.
You can't instrument or image explosions in progress
all that well; you can see what's happening on an
exposed explosive surface, or embed wires for some
detonation wave position/time studies, but nuclear weapons
involve surfaces which are tamped with solid metal layers and
shaped in thin 3-D curves, so not only is nothing exposed
but in a lot of cases the best flash X-ray imaging is
about as precise as the thickness of some of the parts
involved, completely missing any internal details,
and embedding wires will disrupt the symmetry enough
that it changes the results from that of a live weapon.
Three, lab tests of plutonium pit behaviour are in fact
really hard. The problem here being that it's radioactive,
the isotopes that are't fissile being far more radioactive
than the isotopes that are, so if you're doing 1:1 testing
and need to avoid actually exploding something you're talking
about a godawful radioactive mess in the test chamber.
And it's hard to simulate the aging effects that they
are particularly concerned about modeling in new test pits.
Four, they need to know what levels of defect are likely to
cause misbehaviour and therefore need to be detected in
nondestructive testing they can do. Do they have to ultrasound
for 0.1mm defects, or 0.3mm, or 1mm? How much phase change
in the plutonium layers will cause warhead malfunction?
Note that in a lot of these cases, the stockpile stewardship
program really isn't substituting for nuclear testing.
Nuclear testing is really a very gross approach to validating
that the stockpile is valid. It shows that particular units
aren't way out of spec and that your modeling hasn't missed
any gross effects. But that's about all. Figuring out what
the predicted lifetime of parts and what the failure mechanisms
and decay mechanisms are will still need to be done regardless.
Stockpile stewardship is an attempt to minimize having to regularly
rebuild our weapons (junk all the non-nuclear components and build new,
and recast all the fissile materials) every 20 or so years for each
warhead. And given that the newest US warheads are about 12 years
old now, and that some of the ones in service date back to the 1970s,
this is sorta important...
-george william herbert
gher...@retro.com
<snip>
>
>Note that in a lot of these cases, the stockpile stewardship
>program really isn't substituting for nuclear testing.
>Nuclear testing is really a very gross approach to validating
>that the stockpile is valid. It shows that particular units
>aren't way out of spec and that your modeling hasn't missed
>any gross effects. But that's about all. Figuring out what
>the predicted lifetime of parts and what the failure mechanisms
>and decay mechanisms are will still need to be done regardless.
>
<snip>
Had NASA never put anything in orbit, they would be writing textbooks
about their faulty methodologies, handing out prizes to the managers
who constantly screwed up, giving themselves promotions and big
budgets, and otherwise patting themselves on the back. The same could
be said for any number of aerospace programs that existed only on
paper until the moment of truth.
It's really convenient to have you here, so I don't have to explain
how things can get so bad in a program like the Space Shuttle.
They had a model for the wing damage to the shuttle from insulating
foam, too. It wasn't a "what if" scenario. Real, live human beings
were circling the earth, and it wasn't the first time they made a life
or death decision based on faulty analysis and guessed wrong.
There are people on Wall Street playing a similar game, and there are
people dumb enough to buy their advice. Nice gig if you can get it:
advancing (at very high cost) predictions that can't really be tested.
RM
But they did manage to put the first Shuttle stack into orbit and get it
back in one piece - without much incremental testing to speak of. Sure,
that involved some close calls, but it did work the first time.
Jan
You are showing your age :-)
It is really quite common for new kernels or system crashes to leave
the firmware in an inconsistent state, or even corrupt it. So that
is NOT a case where you want a minimal amount of checking! And it
is actually THAT problem that has caused us more wasted time than
the simple one of hardware going AWOL and not being detected by the
automatic checks.
Remember that the above includes almost every case where the solution
to the problem is to power cycle the component, rather than simply
restart it. And there are a lot of those.
|> What I don't want to do is pay the cost of a diagnostic boot -- nay, a
|> full configuration-exploring boot -- each and every time, because the
|> alternative is not available.
No dissension there. At the very least, there should be several
levels:
Boot after a clean shutdown
Boot after a crash, known to be software
Boot after a messy crash [ if the failing component is known,
checking can be concentrated on that ]
Normal field diagnostic boot
Engineering boot 1
Engineering boot 2
. . .
Regards,
Nick Maclaren.
Yes, and the flag-waving and cheering was premature.
Had there been an appropriate level of integrity in the system, they
would have caught and fixed problems much sooner than they did.
My expectation was that, when people could no longer test nuclear
weapons, they would no longer have the expectation that they could
actually use them. I guessed wrong.
I certainly would not base a war-fighting strategy on computer
simulations of equipment that had not actually ever been used for over
a decade.
There are two classes of people keeping this enterprise going:
1. Those with no meaningful technical skills.
2. Those who profit from it.
The ones with no meaningful technical skills are calling the shots.
RM
Well, *that* you can certainly do with Irix on SGI Origins, see
<URL:http://www.sgi.com/developers/technology/irix/partitions.html>
for an overview (as well as the impact of partitioned environments
on software licensing & keys).
+---------------
| If you have that facility I guess a "nice to have" would be adding &
| removing CPUs from CC domains.
+---------------
In Irix you can certainly do that, but AFAIK you have to reboot a
partition ("CC domain") to add/remove a CPU to/from it.
Actually, you can't just add or remove a single CPU, you have to
add/remove an entire NUMAlink "node", containing some memory and
several CPUs, and depending on the current NUMAlink topology, you
might even have to add/remove other nodes and perhaps even routers,
depending on the exact change you're trying to accomplish.
-Rob
-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607
Well, booting *is* typically "seconds" for a single CPU with only
a GB of RAM and a single disk of a dozen GB that wasn't too busy
writing when the system went down. Why, I've seen a small MIPS/Irix
machine do a complete reboot from a clean shutdown in as little as
9 seconds!!
But when you've got 512 CPUs, ~1 TB of RAM, literally *hundreds*
of SCSI or FCS controllers, and a petabyte or so of disks [striped
across all those busses] that was being actively written at ~1 GB/s
when the system crashed, well... that takes just a little bit longer
to reboot, o.k.?!? ;-} ;-}
On the subject of this post, I just read an interesting piece. Take it for
what it is worth.
http://www.crichton-official.com/speeches/speeches_quote04.html
What a horrible thought! I have to admit that my primary exposure is to
S/370 -> zSeries, where firmware and microcode are definitely off-limits
to the operating systems. Yes, there may be pinholes where special
instructions (diagnose and friends) can in principle screw up the machine
in ways that re-IPL might not clean up, but these are available only in a
mode where one would only run production-level systems anyway. Accidental
misuse is extremely unlikely unless one was in fact messing with that very
interface, which one would never do in a production partition.
I've also been involved with pSeries firmware. In that environment good
isolation of the OS can be achieved in a logical partition, and that seems
to be the direction in which big systems are evolving. Interfaces that
could be messed up (e.g. direct access to memory controllers and I/O bridges)
would be fenced off, but it bothered me that in basic mode (non-LPAR) an OS
could in principle mess up basic facilities like those (e.g. via errant store
instructions when address translation is disabled -- but modern OSes have very
limited use of DAT off, if any, so accidental missteps are again unlikely).
There are ways in which configurable components could at least be checked to
see if they have been messed with since the last boot, so they don't have to
be re-initialised. But first one needs to believe that it is worthwhile
trying to cut out unneccessary work at boot time. As long as people think
minutes instead of seconds, this may not happen. That's why my hackles are
raised whenever I hear "takes ONLY n minutes to boot".
> Boot after a clean shutdown
> Boot after a crash, known to be software
If a system (like the one I'm running) doesn't have the concept of shutdown,
this distinction is not needed. It's a real time saver when playing with
experimental kernels. But I agree that the distinction should be made in
the general case. (With my OS I could always simply pick "clean".)
Michel.
I am TALKING about production-level systems :-(
Let me just say that I don't know of any current vendor of large,
"high availability" Unix systems that doesn't suffer from the problem
of firmware getting knotted when the system crashes suitably horribly.
It is rare for it to need reinstalling, but it is not rare for it to
need a power cycle.
The zSeries is an old mainframe system, and starts from a different
viewpoint.
Regards,
Nick Maclaren.
Folks base strategies on things that they have only simulated all the
time. And, since we agreed to stop testing but still need a nuclear
deterent (at least in my opinion) we have no choice but to rely on
simulation to understand the state of the stockpile and perhaps to
design revisions to the weapons. The folks calling the shots hopefully
are relying on people who do understand the problem.
For example large companies project revenues and commit resources and
money to build machines based only on simulation. If the simulation
isn't pretty close, the prototypes won't work, and they won't be able to
be fixed. The revenue won't come in and we will all starve.
Your skepticism would better have been addressed to some of the other
junk science issues floating around which rely on far cruder computer
modeling and simulation techniques, or perhaps just wild assed guesses.
del cecchi
Discussion of specific examples would bring more heat than light and be
off topic as well.
del cecchi.
On current Alpha systems, VMS allows you to move hardware resources
between software partitions without rebooting.
Jan