One of two things has been happening: either lots of other people have
had similar thoughts for similar reasons (as discussed in the
subthread on patents), or my comp.arch posts are extremely
influential.
If I had more time to spare than I actually do (or, actually, if I
could bring more focus to bear than I actually can), I'd have a web
page tracking something like an internconnect/flops index, indicating
the number of times the word interconnect appears in articles
discussing supercomputers, which articles *always* discuss flops. My
sense of the situation is that, in the time I have been annoying
people on this subject, the ratio of interconnect mentions has gone
from essentially zero to something that is noticeably closer to one.
In fact, I might propose something like a cluelessness index, which
would somehow measure the extent to which someone goes on and on about
flops without mentioning *anything* else, and certainly not the
interconnect problem.
Slashdot firehose pointed me to a candidate for the clueless
supercomputing literature index:
a press-release by-product of the usual players (no need for me to
invite lawsuits by naming names; the relevant players do not seem to
be able to get themselves named too often) saying the usual things,
except that the article addresses a second annoying question I've
started to ask, which is: "Why do we build these things AT ALL?"
The answer, not entirely clearly stated, is that if we gather enough
processors together in one warehouse, we will be able to save the
Pacific Coast from an apocalyptic seismic ending. The article never
*says* anything so explicit, of course.
"The point of building a supercomputer is to improve research
capabilities," the article intones, and the clear implication is that,
if we just do enough of it, we'll get better at it.
It would, as I estimate the matter, be hard for us to do worse. When
I pressed a computational geophysicist in private correspondence for
evidence that computation had delivered anything at all in terms of
improved prediction capability, I never did get a response. That
exchange took place when I started to go after Blue Gene hammer and
tongs, and, apparently, the same line, with the same lack of evidence,
is still in play.
I know a little bit of fluid mechanics, so I don't have to do a lot of
digging to savage our national weather and climate prediction
efforts. Here's an astonishingly candid bit about seismic prediction:
http://www.pnsn.org/INFO_GENERAL/eq_prediction.html
which begins:
"Because of their devastating potential, there is great interest in
predicting the location and time of large earthquakes. Although a
great deal is known about where earthquakes are likely, there is
currently no reliable way to predict the days or months when an event
will occur in any specific location."
The equivalent in disastrous weather prediction would be to say that
we know there will be things like tornadoes and hurricanes--we just
don't know where or when, and all the computing we are doing doesn't
give us a reliable methodology."
Nevertheless, we are continually reassured, more is better. I should
add that the subtext of my exchange with the computational
seismologist was that while *I* may not be satisfied with nearest-
neighbor interconnect machines, they work well in his problem area.
Well, I should clarify, they work just about as well as anything else,
which is to say, to a not entirely useless approximation, not at all.
Robert.
> Well, I should clarify, they work just about as well as anything else,
> which is to say, to a not entirely useless approximation, not at all.
Perhaps we could make another ranking: which are the most effective
"large" computers operating today? (are any also in the top500?) Make
notes on how their successors fell short, and/or which architecure
decisions could have been changed so that they were more effective,
rather than less.
Cheers,
--
Andrew
> If I had more time to spare than I actually do (or, actually, if I
> could bring more focus to bear than I actually can), I'd have a web
> page tracking something like an internconnect/flops index, indicating
> the number of times the word interconnect appears in articles
> discussing supercomputers, which articles *always* discuss flops. My
> sense of the situation is that, in the time I have been annoying
> people on this subject, the ratio of interconnect mentions has gone
> from essentially zero to something that is noticeably closer to one.
My impression is that the interest in interconnect comes and goes. It
has been very heavy in the past; the last few years have been
exceptional in that interconnect was underemphasized, partly in a "make
lemonade" manner (we have microprocessors that deliver large flops,
let's use them even if not balanced), but partly, even mainly, because
the last decade has seen a number of applications that can use more
compute than would normally be considered balanced with communications.
This is not just true of supercomputing. It is also true of mainstream
computing, viz multimedia (mainly video) and graphics.
It would be great if attention could somehow be shifted from
capability, like flops, to impact. That is to say, to focus more on
those circumstances in which a huge machine did things that
contributed to understanding in a way that a smaller machine could
not.
The impact of muscular compute farms on feature-length films is
obvious. Very focused computing efforts have played a big role in
contributing to molecular biology, but I don't think that flops have
had much of a role there at all.
Eugene made the case that one big machine can be more cost-effective
to manage than many smaller machines, but, then, why keep a list at
all? If you need the capability, buy it, manage it, and leave off the
press release. Who cares who has gathered the most flops under one
roof?
Do any of the machines in the Top500 run Matlab? That would be an
interesting question to ask, because more work now gets done in
science and technology using Matlab than using Fortran or C.
Robert.
>Do any of the machines in the Top500 run Matlab? That would be an
>interesting question to ask, because more work now gets done in
>science and technology using Matlab than using Fortran or C.
>
You might find this of interest:
http://www.eetimes.com/electronics-news/4210875/Sandia-upgrades-supercomputer
-benchmarks
- Tim
Robert.
Whether it is justified or not, I shall always lay the blame for the
most recent tidal wave of stupidity at the feet of the Top 500 list
and the corresponding emphasis on a benchmark that rewarded design
that could be charitably described as lazy.
Of course, things wouldn't have shaken out quite the way they have if
the floating point throughput problem hadn't been so thoroughly,
cleverly, and completely worked over that it became almost impossible
to imagine keeping the fastest hardware possible fed. That is to say,
I could always switch to complaining that the processor architects
have done their part of the problem entirely too well. ;-)
Robert.
I haven't looked at the detail of the initial results, but I was a little
surprised to see an 8k-node/32k-core BlueGene/P on top. Most of the
other places are taken by much smaller-node-count Cray systems. I guess
that these results are very preliminary, and not particularly
representative, yet:
Cheers,
--
Andrew
> Whether it is justified or not, I shall always lay the blame for the
> most recent tidal wave of stupidity at the feet of the Top 500 list
> and the corresponding emphasis on a benchmark that rewarded design
> that could be charitably described as lazy.
Here I wholeheartedly agree. Focusing on the Top 500 list distorts our
vision.
> Of course, things wouldn't have shaken out quite the way they have if
> the floating point throughput problem hadn't been so thoroughly,
> cleverly, and completely worked over that it became almost impossible
> to imagine keeping the fastest hardware possible fed. That is to say,
> I could always switch to complaining that the processor architects
> have done their part of the problem entirely too well. ;-)
Close, but not exactly that. I say this based on my peripheral
experience with supercomputing:
1) People say "the problem is interconnect, not the processor"
2) People say "we therefore don't need to make any changes or
investment in the processor. we can use an unmodified mass market PC
processor".
3) Then the interconnect guy does his work, and realizes the shotcomings
of the mass market PC processor. (Hi Keith U.!) He swears at the
processor architect saying "Why didn't you do this the way
supercomputers want?"
A: because PCs don't need it that way. Which isn't terribly interesting,
because it is unlikely the processor will be changed just for HPC.
More interesting is when what the HPC interconnect guy wants is a good
idea also for PCs - but a somewhat bleeding edge idea for PCs, that you
can't justify the design effort for. But if the HPC guys paid for it,
it might get used by the mass market, and all would end up happy.
4) But in step 2) they have already decided that they aren;'t gooing to
change the processor at all. So nothing gets done.
The problem IS interconnect, not usually the processor. But oftentimes
the interconnect cannot solve the problem without a little bit of
processor help.
Which doesn't arrive, because of the over-emphasis on interconnect in
isolation.
> More interesting is when what the HPC interconnect guy wants is a good
> idea also for PCs - but a somewhat bleeding edge idea for PCs, that you
> can't justify the design effort for. But if the HPC guys paid for it,
> it might get used by the mass market, and all would end up happy.
Think of my incessant whining as hoping that I, or someone else, or
perhaps a street mob of us, will embarrass the national security
establishment into abandoning a strategy for processors it adopted
almost two decades ago.
Instead of spending money strategically, the DoE shovels money at its
national labs, which in turn shovel money at established players whose
commitment does not include designing an entire computer top to
bottom, the way Seymour did.
The message is "give us good press," and that's what the industry
delivers. Everyone who colludes in this charade needs to own up to
their share of the blame: computational hacks, bureaucrats, companies
looking for easy publicity, and self-promoting self-aggrandizers
trying to find something to replace the TVA and centrifuge tubes in
Iraq as a justification for their continued presence on the national
payroll.
I know that you want to distance yourself from my heavy-handed
polemics, and I understand why. I can take the risk of saying things
the way I do. I want to emphasize that these are entirely my own
thoughts and my own words.
I also want to emphasize that I'm interested in solving problems, not
promoting the relative importance of some technology, like
interconnect.
Robert.
> I haven't looked at the detail of the initial results, but I was a little
> surprised to see an 8k-node/32k-core BlueGene/P on top. Most of the
> other places are taken by much smaller-node-count Cray systems. I guess
> that these results are very preliminary, and not particularly
> representative, yet:
>
> http://www.graph500.org/
>
It's entirely possible, of course, that someone had some other mission
for Blue Gene in mind all along. Remember the "weather satellites"
with which the National Reconnaissance Office started? As these
things go, of course, IBM would have been working the problem long
before the benchmark was announced.
Robert.
That I'm not sure of; the IBM machine I've encountered with the
weirdly unclear mission is the one described at
www.capsl.udel.edu/pub/doc/papers/mobs-isca05.ps.gz and the benchmark
you're looking at in these papers is rather like the RandomAccess part
of the HPCC benchmark mentioned at
http://icl.cs.utk.edu/hpcc/index.html and in the mobs-isca05 paper.
I think the 'must not chase pointers for more than one query at a
time' restriction in the benchmark being discussed here is a
problematic one, since that kind of batching is really quite often
applicable and is one of the better ways of dealing with latencies.
Tom
I think that trying to get the US administration to switch to an
industrial-policy approach for handling supercomputing is a doomed
quest, if only because the Japanese Fifth-Generation work failed so
resoundingly and so expensively.
You can't design an entire computer top to bottom profitably at
current tech level, it's going to be terribly expensive to try, and I
don't see this as a sensible thing to throw money at. Fund
mathematicians to figure out how to do what you want in some
transformed space where there's more locality, and accept that the
answer's quite likely to be 'no'.
Tom
> I think that trying to get the US administration to switch to an
> industrial-policy approach for handling supercomputing is a doomed
> quest, if only because the Japanese Fifth-Generation work failed so
> resoundingly and so expensively.
The US does have a high-tech industrial policy, and it works
incredibly well--so well, in fact, that it makes certain European
manufacturers absolutely crazy. The policy is to build and to fund
lavishly whatever the DoD thinks it needs.
Computer processors are a special case. I've been wanting for some
time to write a post taunting all those who thought that an 800MHz
Pentium III was all the world ever really needed in quantity. The
free hand of the market has worked almost unbelievably well--so well
that even DoD and/or DARPA money can't easily find a place at the
table.
> You can't design an entire computer top to bottom profitably at
> current tech level, it's going to be terribly expensive to try, and I
> don't see this as a sensible thing to throw money at.
To give credit where credit is due, the US government kept trying,
even after it formally decided (at least whenever possible) to get out
of the business of funding purpose-built military hardware that could
be bought off the shelf, and I don't think those efforts can be
regarded as complete failures.
In an ideal world, where there are no anti-trust considerations and no
conflict-of-interest considerations, the government could simply give
money to competent commercial players to adapt their products better
to the specialized needs of HPC, and, again to give credit where
credit is due, the government has (sort of) through DARPA tried to do
just exactly that.
What the government can't do is to do with Intel what I think Intel
more or less does with UC Berkeley: "Here's a bunch of money. Do
great things. Just make sure that some of your successes are on our
short list." Were such a thing possible (and it isn't by any route
that I can see), the government could subsidize out those costs of
meeting some very specialized needs that the commercial market won't
justify.
My goal here is modest. I want the bureaucrats and public relations
departments that we the taxpayers pay for to stop putting their
efforts toward positioning themselves on the Top 500 list (and on the
front page of their local newspaper), and to put their significant
gamesmanship skills to work solving what I will agree is a very hard
bureaucratic problem, which is how to get government money to pay for
what private capital won't without killing the golden goose that
private capital has created.
> Fund
> mathematicians to figure out how to do what you want in some
> transformed space where there's more locality, and accept that the
> answer's quite likely to be 'no'.
Draw a red pencil through three quarters (or more) of the Top500 list
and give the money to real mathematicians.
I'll vote for that.
Robert.
> I think the 'must not chase pointers for more than one query at a
> time' restriction in the benchmark being discussed here is a
> problematic one, since that kind of batching is really quite often
> applicable and is one of the better ways of dealing with latencies.
I might have to dig around in my newly-opened box and find another
gold star. You can make *anything* look massively parallel if you
just get to do zillions of independent problems (like chasing zillions
of independent pointers) all at once. Kudos to the benchmark
designers if they saw that scam and ruled it out from the git-go.
Robert.
You think *you* were surprised? ;) The presentation began with Richard
Murphy eating his words.
The ranking was somewhat arbitrary this time, prefering size over speed.
Also, we didn't include the time to construct the representation given
the edge list. That would have significantly altered some results. I
believe future additions won't be as amenable to GUPS techniques.
One non-measured aspect: The BG/P result required multiple full-time
people working for a month. The Cray XMT results each required a single
person for an afternoon...
Jason
Any kind of WAG what it would cost, assuming one knew the kind of fabric
to build and the kind of moderately-tweaked cores/chips to plug into it,
to get performance that could at least be compared to the
chicken-factories that populate the current list?
paul
Well, as the DoD thinks what it needs is anything that is funded
lavishly, you have a slightly circular situation. Methinks you
are stretching the meaning of the word 'policy' beyond its yield
point.
>Computer processors are a special case. I've been wanting for some
>time to write a post taunting all those who thought that an 800MHz
>Pentium III was all the world ever really needed in quantity. The
>free hand of the market has worked almost unbelievably well--so well
>that even DoD and/or DARPA money can't easily find a place at the
>table.
Eh? It's assuredly all that we currently need for what we currently
do with desktops etc. All that the extra power of more modern CPUs
is used for is to run increasingly inefficient bloatware. The 'free
hand' of the market has ended up selling room heaters masquerading
as computers.
>> You can't design an entire computer top to bottom profitably at
>> current tech level, it's going to be terribly expensive to try, and I
>> don't see this as a sensible thing to throw money at.
>
>To give credit where credit is due, the US government kept trying,
>even after it formally decided (at least whenever possible) to get out
>of the business of funding purpose-built military hardware that could
>be bought off the shelf, and I don't think those efforts can be
>regarded as complete failures.
No - it has an extremely high standard when it comes to complete
failures - mere wastes of money don't even get to the short list.
Regards,
Nick Maclaren.
I had not noticed this part of the spec. It's mainly good, but there is
a downside.
I think that we need to report both the single thread pointer chasing,
and also the multithread. As the number of threads, N, varies from 1 to
large.
Reason for single thread pointer chasing: as Robert says, it can be
broken badly.
Reason for multithread: because it is pretty common to have multiple
threads traversing a graph at the same time. Perhaps not an infinitely
large number, but a reasonable number. And it is important to see if
the system scales as the number of such pointer chasing threads
increases, or if it saturates.
For that matter, it is also important to look at accessing data at the
pointer, and operating on it, and not justchasing pointers - since,
believe me, there are optimizations that apply just to chasing pointers
and not to real workloads. Usually one does something at the nodes one
visits.
Something like
FOR #threads
FOR randomly chose starting point p
visit(p) - operating on data in struct pointed to by p,
and also on stuff p points to,
to some moderate (1-4)
p =p->next[random(i)]
The p->next[random(i)] can be subsumed by a random graph generator.
Then you're timing the traversal and not the *parallel* PRNG. See the
SSCA2 benchmark at http://www.graphanalysis.org/benchmark/index.html for
one example using a perturbed R-MAT generator.
The algorithm used (betweenness centrality) does reduce to sparse
matrix-sparse vector product. Others we work on (k-betweenness
centrality, agglomerative modularity maximization) don't.
There appears to be forward motion on these benchmarks. The criteria
for the DoD UHPC project include some non-trivial graph problems
(e.g. monitoring connected components, which isn't trivial with
deletions on graphs of small diameter).
Jason
I resent that statement; I am using price-performance-optimal
computers to run exquisitely well-written code for factorising large
numbers, letting me do in a week on machines that fit in my study and
use less than a kilowatt jobs that took a large collaboration (300
fast PCs at twelve sites for four calendar months, plus supercomputer
resources for six calendar weeks) in 1999.
At work, I use a machine a bit faster and twice the cost of the
price-performance-optimal one to be able to do useful statistics on
full reoptimisations of every deposited protein structure at a rate
eight times faster than they are being found.
And, yes, for jobs for which Atom boxes are adequate, we have Atom
boxes with SSDs which are silent and whose electricity price is
trivial.
Tom
For turning designs into actual things, first approximation would be
just the cost for the TSMC NRE - on their current GPU/FPGA process,
say the better part of ten million dollars per chip type. The chips
are then $5000 per wafer which gets lost in the noise provided your
yield is respectable; double that to package them; board fabrication
and population, even for something ludicrous with twenty layers, is at
worst comparable to the chips. RAM is on these scales effectively
free if you're using DDR3 or GDDR5; otherwise it's another full-custom
ASIC and you need it in quantities where you might no longer lose the
per-wafer costs in the noise.
So you'd get a fair way with a hundred million dollars given a full
set of designs done so as to use no more than three or four sorts of
chip. If you don't have a lot of people skilled in
design-for-manufacture (and there aren't that many such people and
ATI, nVidia, Altera and Xilinx probably are prepared to go to some
lengths to retain them), factor in more than one spin of the chips,
which might as much as double the price.
If you're doing this with US national-security funding they may stick
on obligations requiring you to use IBM rather than TSMC to fab the
chips, which probably also costs more and takes longer.
It's not completely infeasible; an order of magnitude cheaper than a
spy satellite.
Tom
I didn't mean to put the PRNG in the benchmark.
But, no, the random graph generator does not subsume the randomness of
the walk.
There are optimizations that work well if you always choose the same
path out of a node, or the same path out depending on the path in. (In
general, pointer chasing prediction is very much like branch
prediction). You don't want to make life too easy for those
optimizations - only as easy as it is in the real world, and no easier.
So, yes, you want to generate a random graph, with many paths out of any
given node.
But then, when traversing, you want also to select randomly which path
to take out. Not always take the same path out.
---
Worse comes to worse, you might read the random numbers out of a buffer,
pregenerated - just adding a bit of stream traffic to the benchmark.
But IMHO it might be better to actually be doing something - e.g.
traversing a random B-tree (embedding within your overall graph),
looking for a randiomly chosen key. Should be good enough.
That will not fly. There are many strong voices demanding exactly
reproducible choices. I think it's absurd considering all the other
non-deterministic choices in most runtime environments, but I've lost
that argument many times. Too many people who "agree" still capitulate.
> Worse comes to worse, you might read the random numbers out of a
> buffer, pregenerated - just adding a bit of stream traffic to the
> benchmark.
The hardware and programming part are not the real issue. The
specification and adoption are.
> But IMHO it might be better to actually be doing something -
> e.g. traversing a random B-tree (embedding within your overall graph),
> looking for a randiomly chosen key. Should be good enough.
Now you're dictating a data structure that's not necessarily appropriate
for some architectures. For example, the overhead of a B-tree is silly
on the XMT.
I'd much rather sample a few random graphs each with a few random vertex
permutations and let the algorithm choices fall as they may. But that's
also assuming you time the *construction* of data structures that may
attempt to canonicalize (or evenly spread) the graphs.
Jason
Bizarre. That you resent it, I mean :-)
Yes, I was over-stating, because it applies only to 'normal' desktop
use, not specialist activities like breaking RSA encryption. But I
don't believe that you have a desktop today that is 100 times as
powerful as an 800 MHz Pentium III, which is what you imply. Reread
what I said.
>At work, I use a machine a bit faster and twice the cost of the
>price-performance-optimal one to be able to do useful statistics on
>full reoptimisations of every deposited protein structure at a rate
>eight times faster than they are being found.
On a desktop?
>And, yes, for jobs for which Atom boxes are adequate, we have Atom
>boxes with SSDs which are silent and whose electricity price is
>trivial.
Small, not trivial. For trivial, try my hearing-aids - 5 cores at
20 MIPS and run for ages on a 13ZA silver battery :-)
However, overall, if you had remembered any of my previous posts,
you would have understood my posting better. I was not saying that
we should have stayed with an 800 MHZ Pentium III, but that the
developments since have been in a wrong direction - one that is
non-sustainable, in most of its senses.
Regards,
Nick Maclaren.
> In article <QUu*NJ...@news.chiark.greenend.org.uk>,
> Thomas Womack <two...@chiark.greenend.org.uk> wrote:
> >
> >>Eh? It's assuredly all that we currently need for what we currently
> >>do with desktops etc. All that the extra power of more modern CPUs
> >>is used for is to run increasingly inefficient bloatware.
> >
> >I resent that statement; I am using price-performance-optimal
> >computers to run exquisitely well-written code for factorising large
> >numbers, letting me do in a week on machines that fit in my study and
> >use less than a kilowatt jobs that took a large collaboration (300
> >fast PCs at twelve sites for four calendar months, plus supercomputer
> >resources for six calendar weeks) in 1999.
>
> Bizarre. That you resent it, I mean :-)
>
> Yes, I was over-stating, because it applies only to 'normal' desktop
> use, not specialist activities like breaking RSA encryption. But I
> don't believe that you have a desktop today that is 100 times as
> powerful as an 800 MHz Pentium III, which is what you imply. Reread
> what I said.
An 800 MHz Pentium III would have trouble with most YouTube videos these
days.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
It would. Even ignoring the very dubious 'need' for such things,
my statement about increasingly inefficient bloatware stands.
Regards,
Nick Maclaren.
Oh for ***'* ******* sake. I try rarely to get annoyed, but what you say
is so silly that you should think about it.
When I suggest making a pseudo-random choice of which edge out to
follow, that is deterministic. Pseudo-random is deterministic. Start
with the same seed.
Yes, you want benchmarks that are precisely deterministic. Heck, I'm
the guy who advocates making as much of parallel programming
deterministic as possible (see
http://semipublic.comp-arch.net/wiki/Proposals_for_Increasing_Determinism_of_Parallel_Programs,
and many posts on this newsgroup).
> I'd much rather sample a few random graphs each with a few random vertex
> permutations and let the algorithm choices fall as they may. But that's
> also assuming you time the *construction* of data structures that may
> attempt to canonicalize (or evenly spread) the graphs.
If by "a few random vertex permutations" you mean "a few randomly chosen
paths through the graph", them I would agree with you. But I don't
think that's what you are saying.
Random does not necessarily mean uniform. You may want random
distributions that mimick workload statistics - e.g. Zipfian for long
tail distributions.
And if you are modeling real world stats, you have to deal with locality.
All I am saying is that you should not add more locality than reality
desires.
And the rest of the software and hardware stack is non-deterministic.
We either can pretend we live in an artificial world or adopt
methodologies used in real sciences. Start off assuming there are
variabilities outside our control, and analyze what data we find. Real
sciences already have blazed the trail.
Which isn't to say that we should ignore determinism in the small. I
just don't think it's a reasonable target once we start combining pieces
from many different groups with different objectives. We need to limit
the effects of non-determinism, but a benchmark that considers and
reports those effects directly seems more valuable to me than one that
pretends they don't exist. But I already have lost that argument.
Don't discount the overhead in either processing, communication, or
memory for a deterministic PRNG coordinated across millions of
processing elements.
> Yes, you want benchmarks that are precisely deterministic.
Want is not the same as have. Perhaps I'm trying to tackle too many
practices at once.
> If by "a few random vertex permutations" you mean "a few randomly
> chosen paths through the graph", them I would agree with you. But I
> don't think that's what you are saying.
Mostly equivalent for the algorithms I have in mind (BC, MIS, etc.).
Not perfect, but sampling multiple vertex permutations of the same base
graph would smooth out the difference (short of someone implementing
graph isomorphism as a preprocessing step).
> All I am saying is that you should not add more locality than reality
> desires.
Desires? Reality in the form of current architectures desires a ton of
locality but reality in the form of problems to solve provides almost
none.
Jason
Not quite, since the default resolution and codec is still VGA and MPEG2:
We could decode 30 Hz DVDs with zero frame loss on a 233 MHz Pentium MMX
back around 1993.
As soon as you throw in h264 and higher resolution you're absolutely
right. :-)
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
And a lot of people want things that are impossible. Determinism
in sufficiently complex systems is one of them, and it really is
about time that computer people admitted that. In particular,
determinism is incompatible with efficiency, when using MOST
parallel paradigms.
Live with it. Learn from statistics. Controlled non-determinism
isn't a problem, if you have the right attitude.
Regards,
Nick Maclaren.
On one hand easily reproducible cases are nice to investigate. I would
argue that a non-deterministic benchmark gives better information to the
rest of us , providing a large enough sample of runs to be meaningful is
taken. This may mean reducing the benchmark problem size but
min,max,mean,median,standard deviation are more informative for people
looking at the benchmarks when making a decision.
On 11/26/2010 08:43 AM, Andy "Krazy" Glew wrote:
> Heck, I'm the guy who advocates making as much of parallel programming deterministic
> as possible (see http://semipublic.comp-arch.net/wiki/Proposals_for_Increasing_Determinism_of_Parallel_Programs,
> and many posts on this newsgroup).
I'd need to refresh myself on your proposals but in general I can see it
being harmful , if it limits our choices too much in the future. Then we
have the case where extra determinism could actually do more harm than
good when it comes to parallel programming.
Cheers,
Ze
Have you any decent links to the semiconductor technology used
in hearing aids? Looking at it has been on my "bucket list"
for a long time simply because their optimisation criteria
are so different to that used by the rest of the semiconductor
industry.
thanks
Nope. Sorry. I asked out of curiosity, but am not deeply curious.
Actually, their criteria are quite similar to those of a lot of
other embedded fields - the sort of chips that go into smart cards,
for example.
Regards,
Nick Maclaren.
IBM has some of the requisite people. Don't forget you would probably
need a couple hundred folks for two or three years at 1/3 Mdollars per
person. So that is another couple hundred million there. I would say
you could get a couple of machines running for maybe a Billion, in
round numbers.
>
Edits above.
Jason
I'm kind of on hold to see what the experience with Blue Waters is
like.
In the meantime, I want to make the petaflop empire-builders out to be
the unimaginative self-aggrandizers they are.
It would be better to build lots of smaller experimental computers,
with or without graduate student slave labor, than to keep up the
current absurd petaflops race.
Robert.
Oh, there are acceptance tests *before* it's called Blue Waters. I'm on
hold w.r.t. those tests... and looking forward to finding out one way or
another for graph problems.
Jason
There is a substantial portion of the work involved that cannot be
efficiently done by grad students. They do not necessarily have the
practical skills, although I am sure that some do. But I have seen a
lot of new hires over the years and most of them didn't show up with
those type of skills. We are talking about a I7 or a Power7 class
processor, together with all that is required to test, package, power,
and cool it in a Supercomputer class machine.
I am skeptical.
What's an I7 class processor?
I think I know Power7.
Latest Intel x86 nomenclature?
i7, i5 and i3, with i7 the high end.
Well, yes and no. I agree with your points about the quality of
graduates, but doubt the number of people. For a processor like
those, yes, but for a completely new and much simpler and cleaner
design, a couple of dozen sounds more plausible. Still not cheap.
Of course, I am doing my usual, and assuming that the software gets
a kick where it does most good and is sorted out to make it easier
to get performance on simpler and cleaner designs. If we start by
assuming that the existing junk is sacrosanct, no major improvement
is possible, ever.
Regards,
Nick Maclaren.
Yeah, what he said is what I meant. I have been PC shopping. What a
maze of slippery slopes.
So there is some magic ISA/microarchitecture that will allow some
relatively small easy to design chip to magically surpass something
like power7 or the latest high end thing from Intel? I am even more
skeptical.
Del, please don't pretend to be a clot, because it doesn't work.
The reason that those chips take a small army of architects is
because they are fiendishly complicated, which in turn is caused
by the desire to make them perform well on the existing spaghetti
codes. There were many older, good, designs that took the number
of people I mentioned, and it wouldn't take much more to produce
a modern one along the same lines using much larger limits and
modern processes.
As we all know, such a design would have a native (raw) performance
many times that of the chips you mention, because it would have
vastly many more cores. It would, however, probably be infeasible
to deliver full coherence across the whole CPU - but, as I have
posted several times before, that requirement is an Article Of Faith
and not actually a software prerequisite.
And, of course, its serial and close-coupled performance on spaghetti
codes would be worse. HOWEVER, look at the sort of thing that Mitch
and others were proposing. Their performance on entangled spaghetti
will be dire, because of all of the backing off needed. My strong
assertion is that we CANNOT do a lot better, HOWEVER we do it, until
and unless we tackle that problem - and it's a software one, not a
hardware one.
But let's get back to the latest and greatest chips from the major
players. I haven't been closely involved with benchmarking them
for a few years, but have contacts with more recent experience (not
Power, though). ALMOST ALL of their performance gains come from
simple scaling up - which I accept is not as trivial as the naive
think - rather than their incredible ingenuity.
What we are doing is heading up a mountain valley - it is getting
harder and harder to proceed as we get further, and our progress
is getting slower and slower. Dogma says that this is the way
forward, but I believe that we are heading straight up an
unclimbable mountain, and need to back off and try another route.
Regards,
Nick Maclaren.
> What we are doing is heading up a mountain valley - it is getting
> harder and harder to proceed as we get further, and our progress
> is getting slower and slower. Dogma says that this is the way
> forward, but I believe that we are heading straight up an
> unclimbable mountain, and need to back off and try another route.
You may very well be right, but do we have the methods for writing new
software appropriate to these different valleys reasonably easily?
The certainty that it ought to be possible needs to be turned into
practical techniques before adoption (and probably, some evolution) is
possible.
--
John Dallman, j...@cix.co.uk, HTML mail is treated as probable spam.
No - quite the converse.
>The certainty that it ought to be possible needs to be turned into
>practical techniques before adoption (and probably, some evolution) is
>possible.
That is what everyone says, and it is a classic error. The one case
when it is correct to take a wild gamble on something that you have
no evidence is possible is when the alternative is to carry on doing
something that you know is certain to fail. And, regrettably, that is
the case here.
Unfortunately, human psychology militates against that - hope springs
infernal in the human breast :-(
Regards,
Nick Maclaren.
Architects? I wasn't even thinking about them. I was thinking about
layout, simulation, timing, circuit design etc. That is where my
experience lies. And the idea that somehow a few grad students can
knock out a system that will provide world class performance flies in
the face of my personal exprience down next to the iron. Or silicon.
whatever.
Can you explain the word "entangled" with respect to "speghetti"?
Mitch
This is, in effect, the Hisenberg Uncertainty Principle applied to
<large scale> multiprocessing.
Mitch
No, but you can get within a factor of 2X with 10% of the effort.
Mitch
Oh, right, but I wasn't thinking about students, either - for heaven's
sake, I teach graduates at a top university and don't have too many
illusions.
I was thinking about a few dozen good, experienced engineers. The
point is that there have been a lot of leading chips produced with
less than that, admittedly starting from previous designs. The keys
to my thinking are scalability and orthogonality - where you don't
have interactions, you don't have problems with them.
For example, I would expect the logic to be no more complex than a
fairly decent mid-1980s design, though it would use later methods
where appropriate. But there would be good primitives to allow those
to be used in parallel far more than at present (on general CPUs).
Regards,
Nick Maclaren.
Do you have long spaghetti where you are? Anyway, it's not too
hard to deal with when it's fairly separate, but it becomes impossible
when all of the strands ties themselves into a knot. The Alexandrine
technique becomesa the only feasible one :-)
A lot of modern software is like that. Seriously.
Regards,
Nick Maclaren.
Yes, with the proviso that you are talking about systems where the
complexity increases with size.
Regards,
Nick Maclaren.
And that is my point. A 'raw' performance of 0.5x, an increase in
parallelism of 16x, and a reduction in overall software efficiency
of 0.5x gives a factor of 4x increase. I believe that all of those
numbers are reasonable estimates, and feasible for a reasonable
effort.
A 'raw' performance of 0.25x, an increase in parallelism of 256x,
and a reduction in overall software efficiency of 0.25x is much
more interesting, but a LOT harder to achieve on the software front.
Note that I am NOT talking about a software efficiency of 50% or
25%, but a reduction relative to currently delivered efficiencies.
Regards,
Nick Maclaren.
Well there is actual real data on this - the recent SiCortex
experiment. With the first
20 million dollars, we delivered 5800 core systems using an in-house
designed
ASIC containing an assembly of commercially available chip IP and full
custom
interconnect. The cores were MIPS and suffered from some memory
bandwidth
issues, but the systems were performance competitive and used 1/10 the
power.
The interconnect was so good that the large system held the #3 spot
worldwide
for GUPS. (GUPS 4.6)
The chip design team for this was about 10, plus about 10 for
verification. (I was on the
software side.
The second generation had a new core design as well, but the team
wasn't any bigger.
These days, the tools are good enough that if you have the <right> 20
people you can
build your own machine and get it to work first pass - for about $20
million.
IIRC, the yields were so good that we were able to ship fast corner
chips and boost
the clock by 40%.
-Larry
While I agree that a lot of modern SW is like that.
I disagree that the core I proposed would be at any useful
disadvantage WRT great big modern OoO machinies on thse kind of codes,
once you account for targeting 0.5 I/c rather than 1.0 I/c. In many
situations where the predictability is "not so good" the thin light
core has a distinct advantage because the latency between instruction
issue and branch resolution is so much smaller.
Mitch
Alas, yes.
> I was thinking about a few dozen good, experienced engineers.
I was thinking a horde of grad students (~30) plus a handful of faculty
(~8) and experienced engineers dropping in and consulting every month or
so. There are many examples of "challenge problems" tackled by such
groups. This would need donations from some fab, however...
High-risk, high-reward hardware isn't flying in the commercial space
right now. All the focus is on existing codes, even if RFPs are written
pretending it's not. And some of the "semi-academic" projects I'm
seeing are "high-risk" in the wrong ways (economic risk, not idea risk).
Such a project may just flop, but start a few at different places and
one may work. The VC model does sometimes pay out.
(And I'm ignoring the patent issue. HW design is a minefield.)
Jason
I am not disputing that, with your qualification. The performance
is currently dire, too! What I was saying is that it would be no
better, relatively. That sort of code is beyond redemption, and
will knacker ANY design.
Regards,
Nick Maclaren.
> (And I'm ignoring the patent issue. HW design is a minefield.)
>
That's why the project needs to be at least nominally a DARPA project.
If you are National Security, you can snap your fingers at anything.
Robert.
This is presumably the reason several chips have gone in the
multi-threading direction:
With 4 threads/core and a maximum instruction latency of 4 cycles, you
can run most any kind of spaghetti code with zero bubbles.
The same goes for branching, a core like the 486 or Pentium only needed
3-5 cycles to recover from a missed branch, so with a round robin
scheduled multithreaded core you might not even need any branch prediction?
I confess to being somewhat confused by the diffuseness of your
discontent. I'd like to be able to run a certain class of bigger
highly nonlinear problems and, when asked why, I can give a specific
answer, even though I think almost no one understands the answer I
give.
Even if you believe in the integrity of a class of artificially
localized computations (and I don't), they have become intolerably
expensive to run even with *no* interconnect (or practically none) and
legions of civil-service type employees spending arbitrarily long
periods of time getting anything to run at all. That's the problem
Blue Gene "solved."
Others are concerned about performance on graph problems and are doing
something about it. What problem do you need solved that isn't being
addressed? Please be specific.
Robert.
Mitch
cf/ Northstar/pulsar/sstar. Watched that . 2X and you are out of
business.
You still may need assistance prefetching the instructions into cache.
Jason
Maybe, and maybe not. There was and is a LOT more to success and
failure in that area than performance. I haven't been closely
involved with IBM since about 1994, but know that the politics and
general, er, management problems continued thereafter. Whether they
contributed to the POWER4 debacle, I can't say.
Also things have changed. There is now a BIG potential market for
energy-efficient computing (including in HPC), and dropping a factor
of two in serial performance isn't the killer it used to be. What
still matters, of course, is performance/whatsit, where whatsit may
be dollar, watt or cubic metre. I was one of the first people to
specify that in a HPC procurement, but now it is common.
Regards,
Nick Maclaren.
Yeah, but since the branch mispredict penalty is so low, you really
don't need to spend such a huge effort on branch prediction, right?
> Maybe, and maybe not. There was and is a LOT more to success and
> failure in that area than performance. I haven't been closely
> involved with IBM since about 1994, but know that the politics and
> general, er, management problems continued thereafter. Whether they
> contributed to the POWER4 debacle, I can't say.
POWER4 debacle? Is that like the iPhone 4 antennagate?
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Sounds like LRB advocacy.
I know that you were personally involved, but try to be objective. The
LRB story so far seem to prove once again that there are no simple
solutions for complex problems. Or, more precisely, there are no good
simple solutions as long as the complex problem is sufficiently
generic.
I was waiting for someone to notice this. :-)
It is actually much more Sun Niagara advocacy, Larrabee is just one of
the chips that have followed them in the same direction, i.e. throughput
instead of peak computing.
What do you think then of T4 reportedly dropping the core count to
boost frequency?
rick jones
--
It is not a question of half full or empty - the glass has a leak.
The real question is "Can it be patched?"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Reviving this sub-thread to add::
Consider 2 machines:
A SIO (small in-order machine) targeting 0.5 I/c from a max of 1.0 I/c
A GBOoO (Great Big OoO) machine targeting 1.0 I/c from a max of 4.0 I/
c
Running spaghetti code with low predictability and with large cache
miss footprint (like some DB codes) that up causing the GBOoO machine
to run at 0.1 I/c.
It turns out that the SIO machine will be running very close to 0.09 I/
c on such a code sequence.
Thus, the GBOoO machines are really steallar only when running codes
that fuly exploit their emense tallent at extracting ILP. When the ILP
is not present, they run only fractionally faster then the SIO
machines.
Mitch
But you must also consider
SOO (small out-of-order machine), targeting 0.7 I/c from a max of 1.0 I/c.
On a lot of your database workloads there is quite a lot of MLP -
multiple cache misses in flight. Which would be lost when your SIO
blocks on the first cache miss, or use thereof.
If true, that means that they have a large percentage of costumers who
do care about the speed of old single-thread programs. :-)
> On a lot of your database workloads there is quite a lot of MLP -
> multiple cache misses in flight. Which would be lost when your SIO
> blocks on the first cache miss, or use thereof.- Hide quoted text -
Actually mine would detect up to 3 misses before noticing the first
one missed--and process accordingly.
But the argument has merit--depending on the footprint fo the core and
the power consumed to get OoO activities.
Mitch
As apparently does Google:
“Brawny cores still beat wimpy cores, most of the time”, Urs Hoelzle,
IEEE MICRO (2010) (to appear).
===
My memory seems to recall that Urs Hoelzle was going around making the
pitch for lots of small cores for Google workloads. The pitch that may
have led to Niagara, and in part led to some non-graphics Larrabee
predecessor proposals.
Reference?
ISTM that the Niagara series should go superscalar.
If I understand correctly, current cores have four
issue slots--two isolated ALUs, a shared multiplier,
and a shared LSU--but can only issue one instruction
from each thread per cycle. It might be practical
to modestly tweak T2/T3 cores to enable 2-way
superscalar activity for a significant (>20%?)
single-thread performance boost. (Speculative MT
and/or runahead might also be somewhat practical as
a somewhat moderate extension to the existing design.)
Paul A. Clayton
just a technophile
Not a fair way to ask the question.
Consider N between 8 and 64:
Do N brawny cores beat 8*N winmy cores? Equal area/equal power way of
asking the question.
Mitch
I get a different takeaway message, and it's just another version of
what I've been harping on for years--the way we design and build
hardware has implications for choices made by everyone who uses the
hardware. In particular, it affects the cost of development, the
style of software, and even what people will regard as possible or
plausible. It *isn't* about old code. It's about new code that you
are writing with new hardware constraints and capabilities in mind:
"Cost numbers used by wimpy-core evangelists always exclude software
development costs.
Unfortunately, wimpy-core systems can require applications to be
explicitly parallelized or otherwise
optimized for acceptable performance. For example, suppose a Web
service runs with a latency of one
second per user request, half of it caused by serial CPU time. If we
switch to wimpy-core servers, whose
single-threaded performance is three times slower, the response time
doubles to two seconds and
developers might have to spend a substantial amount of effort to
optimize the code to get back to the onesecond latency. Software
development costs often dominate a company’s overall technical
expenses, so
forcing programmers to parallelize more code can cost more than we’d
save on the hardware side. Most
application programmers prefer to think of an individual request as a
single-threaded program, leaving the
more difficult parallelization problem to middleware that exploits
request-level parallelism (that is, it runs
independent user requests in separate threads or on separate
machines). Once cores become too slow to
make this view practical for most applications, you know you’re in
trouble."
I have my own version of this complaint, which I have repeated ad
nauseum.
Robert.
A quote from Hoelzle's[*] article:
>" ... Software development costs often dominate
>a company’s overall technical expenses, so forcing programmers to
>parallelize more code can cost more than we’d save on the hardware
>side. Most application programmers prefer to think of an individual
>request as a single-threaded program, leaving the more difficult
>parallelization problem to middleware that exploits request-level
>parallelism (that is, it runs independent user requests in separate
>threads or on separate machines). Once cores become too slow to
>make this view practical for most applications, you know you’re in
>trouble."
>
>I have my own version of this complaint, which I have repeated ad
>nauseum.
This is all well known, but I see it as an argument for heterogeneous
cores with asymmetric instruction sets rather than a general promotion
of complex cores. There is a need in general software both for low
latency context switching and for fast serial execution.
I'm not a hardware person and I don't really know what is possible in
the way of intelligent hardware trace scheduling if the cores have
symmetric ISAs but offer different levels of performance. However,
with asymmetric ISAs a trace scheduler can dispatch according to what
instructions are used. On the software side I am aware that it is
difficult for a compiler to decide automatically to target certain
traces toward a particular ISA, but as a practical matter, it is
relatively simple for a programmer to flag sections of code for
special handling.
George
[*] Can't seem to get my reader to produce an umlaut.
Urs answers that: probably. I.e. probably N brawny cores beat 8*N wimpy
cores. Not for all workloads, but for many.
---
I was thinking about "Dark Silicon" today, as presented at HotChips:
* #transistors per chip is still increasing
* but transistor speed is not increasing
* and transistor capacitance is levelling off
which implies: you can have more transistors, but they aren't any
faster, and you can only turn on the same number of them in the next
generation as you did in the old generation. I.e. if you have 2X more
transistors, the overall activity factor must be half. And possibly
less, because of increased wiring to connect those extra transistors.
So, what do you do with these extra transistors?:
Multicore? Bzztt... two or four instances of the same core, running at
the same frequency, will use 2X or 4X the power. You have to run them at
1/2 to 1/4 the frequency. Especially if we are in the range where
voltage scaling no longer works.
You may be able to get away with more simpler cores, that switch fewer
transistors for the same instructions per second. Looks like we want a
metric like Cdyn/instruction, or, more abstractly, transistor switches
per instruction.
The guys presenting dark silicon were espousing lots of special purpose
hardware, that spends most of its time sitting turned off, power gated,
zero leakage, which is turned on only when you need that special
function. In their case they proposed reconfigurable; but reconfigurable
logic starts off so penalized in area speed and power, by factors
approaching 16X at the high end. Methinks you want as much hardwired
special purpose logic as possible, augmented possibly by reconfigurable
for stuff you don't know about in advance.
But it also causes me to wonder whether we should not follow in the path
of IBM z-series, and make CISC instructions faster. Spend transistors,
e.g. on decimal logic. So long as those transistors can be power gated
when not in use.
Sun got burned by Rock wrt SpMT.
I'm still an SpMT fan, but AFAIK Haitham Akkary is the best hope for SpMT.
Overall, that two-page article seems fairly balanced. The first
half of the last sentences demonstrates this: "So go forth and
multiply your cores, but do it in moderation,"
There were some noticable weaknesses in the arguments:
* The small paragraph on power efficiency did not consider the
benefit of _simpler_ cores--only the frequency-related benefit.
* The software development cost issue did not consider the side
benefit from refactoring the code. The greater refactoring
in part targeting many cores (ideally with the ability to
remerge divisions somewhat easily) would presumably cost much
more than a refactoring for just maintainability, extendability,
and the like; but in addition to allowing the new software to
work well on a broader range of systems the greater refactoring
may force certain questions to be asked which might not be
brought to mind by a simpler refactoring, potentially having
good side effects on modularity, conceptual clarity (recognizing
the desired semantics and not just one way of delivering the
appropriate result in a specific system), etc.
* There is a confutation of "wimpy-core" with weak nodes. Wimpy
core does not make necessary wimpy socket. (Wimpy core can
actually increase per-socket non-CPU resources: fewer power
connections allows for more I/O connections.) Wimpy core also
does not make necessary wimpy system.
* I also disagree with the implication of the statement "Once a
chip's single-core performance lags by more than a factor to two
or so behind the higher end of current-generation commodity
processors, making a business case for switching to the wimpy
system becomes increasingly difficult because application
programmers will see it as a significant performance regression:
their single-threaded request handlers are no longer fast enough
to meet latency targets." It makes a double counting of costs:
refactoring costs are added to frustrations from lower performance
of non-refactored code.
Whether many-core beats multi-core depends considerably on the
workload and the specifics of the microarchitecture. It seems to
me--a mere technophile--that beyond hetergeneous cores, there
should be some effort to have flexibility in the configuration of
a core. Standard SMT addresses this somewhat (with much greater
emphasis on single-threaded performance), but generally fails to
exploit partitioning/clustering (Power7 does exploit clustering
to support four threads but I suspect does not support any data
cache partitioning, e.g.). Standard small cores seem to
de-emphasize fast and efficient inter-thread communication and/or
excessively penalize a lack of sharing (e.g., it might make
sense for a pair of cores to have their register files placed
back-to-back to allow register contents to be transferred quickly
between cores in a pair--it might even be practical, e.g., with
banking, to provide greater register bandwidth than can be or
will typically be consumed by the execution resources of the core
in order to accelerate inter-core communication). (I am a bit
disappointed that tag-serial cache access does not seem to be
provided for cases when latency can be hidden by multithreading.
Such could allow even greater efficiency. More intelligent
bank scheduling might also be more practical when latency can
be more easily hidden.)
(By the way, could width pipelining be used [productively] in
instruction fetch and decode? This seems particularly significant
for something like the Bulldozer microarchitecture if one could
provide twice the repeat rate for two threads as for one thread.
Some of the activity is directional--shifting to a jump target
within a fetch block [for block-fetch ICaches], parsing the
block of instructions [for variable length instructions],
checking dependencies--, but it seems likely that there would be
little actual benefit from such.)
Side note: It seems that a heavily pre-decoded block-fetch ICache
or trace cache could allow some interesting optimizations. One
might be able to accomplish what was sought with Itanium's
template bits--simplified routing. One might allow a RAT to have
many fewer ports by a smart arrangement of register IDs in the
block of instructions to avoid bank conflicts. Large fetch blocks
could be split such that the first half of a block is in the near
half of the cache, possibly with critical instructions moved to
the first half (width pipelining fetch). (Pre-renaming is
another obvious optimization. Replacing much of the L1 BTB with
portions of the instruction is yet another optimization--fetching,
e.g., 21 bits, with 16+ bits of information extracted from the
instruction stream, for every 64 bits of instructions.)
A side thought: if one has a ring interconnect and uses pipeline-
style parallelism in which different stages of the 'pipeline'
produce different localized hotspots, one might benefit from
shifting the pipeline around the ring. This might also apply
to imbalanced pipeline stages where one over-worked stage might
run in turbomode--generating localized-to-the-core excessive
heating (similar turbomode speed-up might be requested to
accelerate a task that has been enqueued longer than average;
cool cores might specifically seek such tasks, while hot cores
might prefer low-work/non-critical tasks [and work in a power-
saving mode] and typical and typical warm cores might have
intermediate preferences). (This might apply more frequently
with specialized functionality which becomes more attractive
with greater constraints on power and less constraints on
area [dark silicon].)
It also seems that in the future more of the design effort will be
directed to the memory system (including L1 caches) and
interconnect.
>
> * There is a confutation of "wimpy-core" with weak nodes. Wimpy
> core does not make necessary wimpy socket. (Wimpy core can
> actually increase per-socket non-CPU resources: fewer power
> connections allows for more I/O connections.) Wimpy core also
> does not make necessary wimpy system.
>
That's true. Just put enough of them together and you can get any
Linpack score you want. Entire careers have been built on that
"insight." Careers? Entire massive boondoggles with huge org
charts, payrolls, and office acreage haven't needed to know any more.
Robert.
And Intel got burned by the Pentium 4 but that does not
mean that all the neat ideas (e.g., trace cache, width
pipelined ALUs, way prediction, SMT) should be tossed
aside. (At least SMT has made a comeback.)
I think Sun bet too much and boasted too much on a
microarchitecture that was a worse fit than alternatives
if certain new ideas were not well implemented (and
worth implementing well). Laying SpMT on top of a
T series processor would (I believe) have very little
downside; it might be comparable to the earlier
Pentium 4 SMT: turning it off is inexpensive even
relative to a non-SMT design, turning it on may
help some workloads. Of course, supporting SpMT would
substantially increase design and validation costs,
but even if the first version never got it to work
or be beneficial the experience could be used to
better understand the tradeoffs. Of course, Sun was
also under significant financial pressure; Oracle
might be able to "throw away" some money (I am not
optimistic about Sun-like innovation [much of which
was, unfortunately, just a financial drain] coming
out of Oracle. I would not be surprised if most of
the remaining Sun engineers are somewhat burned-out
and are likely to have significant culture clash--Sun
seemed to have an academic/engineering culture [like
the old Bell Labs?] while I receive the impression
that Oracle has a more profit-driven culture. It
would be nice if the best of both companies were
preserved, but I suspect Sun will just be swallowed
up.).
> I'm still an SpMT fan, but AFAIK Haitham Akkary is the best hope for SpMT.
I suspect some embedded system (with greater control
of the entire stack, ability to make specialized
optimization, and perhaps a greater willingness to
take risks) will implement such first, but the delay
may be several decades.
Processor by Victorinox?
rick jones
> But it also causes me to wonder whether we should not follow in the path
> of IBM z-series, and make CISC instructions faster. Spend transistors,
> e.g. on decimal logic. So long as those transistors can be power gated
> when not in use.
--
denial, anger, bargaining, depression, acceptance, rebirth...
where do you want to be today?
I think itis MORE than fair to say that Intel more than fully
recovered its costs in design and development of Pentium 4;
whereas it is not fair to say that Sun did likewise.
Mitch
Knowing the full story of Pentium 4 would be fascinating. For one
thing, I wonder to what extent Intel expected/hoped that people would
rewrite software targeting the architecture specifically.
Naive questions:
Would Netburst fly with current prediction/speculation tricks and
process tech? Or is that, in effect, what we're going to see with
Sandy Bridge?
Robert.
No.
Sandy Bridge is what P6 version 2.0 should have been.
It's the first new OOO core from Intel, designed by people who actually
believed in OOO. Pentium 4 was designed by people who really did not
want to be doing OOO, but were forced to because the simulators showed
what they really wanted to build did not work.
Although Sandy Bridge was marketed as a P6 family core, it really is a
step beyond.
That is true, but technical success and marketing success are often
unrelated.
Regards,
Nick Maclaren.
So, um, this would be a Web service that was so CPU bound that it managed
only "several" transactions per second even with the favoured beefy cores.
Isn't that a rather unusual case to pick as an example for one's argument?
~15 years is an _amazingly_ slow step in this business.
So (I have to guess because I don't know), the reason for the
allusions to NetBurst in the press must be the result of the inclusion
of something with the flavor of a trace cache in Sandy Bridge.
Robert.
If Intel had used a second (Pentium III based) Tick, it
could probably have recovered the cost of the shrink and
tweak in addition to the cost in design and development of
Pentium 4. With a more mature process than AMD (higher
yield and better device libraries?) and more capacity,
Intel could afford to produce something much worse than
the Pentium 4 and not be driven down below 50% of x86
revenue (I am guessing).
Sun was in a much less comfortable financial and market
share position. I do not think Sun had the design
resources of Intel to fix a weak microarchitecture (this
can be a good thing--I got the impression that there was
some throw-it-over-the-wall behavior on the Pentium 4
group which might have been less practical in a smaller
design group). Rock may also have been a more daring
microarchitecture than Pentium 4--it certainly seemed to
depend more on all the pieces fitting together at the
end (and seemingly not being able to recognize that
the puzzle was much more difficult than thought until
very near the end).
I do hope that Intel will be willing to try more
adventurous ideas, but I receive the impression that
the relative weakness of the Pentium 4 will poison
many of the ideas--which may be particularly bad
because the designers may actually have learned a
substantial amount on how to better use the ideas.
I admit my ignorance--and my ignorance is doubtless
vaster than I suspect--, but I think there might be
some bit of truth in the above.
But Intel didn't, and it didn't for sound business reasons. It was in
a megahertz race with AMD, and AMD had beaten Intel to 1GHz. The best
thing to sell to a customer is what the customer already wants to buy,
and what the customer wanted to buy was megahertz, and that's what
NetBurst delivered--for a while, at least.
Maybe someone can help me pin down the exact quote from an Intel
corner office: "It's the megahertz they buy,"--or something like that.
Robert.
In that environment, it becomes even more surprising that they never
tried to claim the internal (doubled) clock rate of the fast core as the
clock speed.
I.e. 6-7 GHz is a nice, big number.
In the P4 environment it most meant that the chip was amazingly fast at
getting from one stumbling block (slow core instruction, (missed)
branch, or those horribly slow shifts and muls) to the next.
"P4 - Waiting faster than any other cpu!"
>Robert Myers wrote:
>> But Intel didn't, and it didn't for sound business reasons. It was in
>> a megahertz race with AMD, and AMD had beaten Intel to 1GHz. The best
>> thing to sell to a customer is what the customer already wants to buy,
>> and what the customer wanted to buy was megahertz, and that's what
>> NetBurst delivered--for a while, at least.
>>
>> Maybe someone can help me pin down the exact quote from an Intel
>> corner office: "It's the megahertz they buy,"--or something like that.
>
>In that environment, it becomes even more surprising that they never
>tried to claim the internal (doubled) clock rate of the fast core as the
>clock speed.
They never designed it to sell it on clock-speed. That is a myth.
Sorry, but selling it on clock speed was making the rounds inside Intel
at the time.
It was not the only meme. And not the only rationale for Willamette, or
the fireball.
I suspect that selling the 2X clock speed would have made it so
embarassingly obvious that clock speed is not the main determinant of
performance.
By the way, I once took a prominent IBM Power 4 engineer/architect out
to dinner. IIRC Power 4 was superpipelined, with the ALU taking 2
cycles. He was quick chagrined when I and Mike Haertel, the other host,
told him that according to our lights, and relative to Wmt, the Power4
really was running at only half the frequency he talked about.
I tend to use the fastest back to back dependent op as the
"microarchitectural" frequency. As compared to a circuits clock speed.
I.e. IBM *was* selling the clock frequency with Power4.
Wmt's fireball was doing back to back dependent ops at 2X the advertized
frequency.
While Power4 was doing back-to-back dependent ops at 0.5X the advertized
frequency.
>On 12/10/2010 6:24 AM, Bengt Larsson wrote:
>> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>>
>>> Robert Myers wrote:
>>>> But Intel didn't, and it didn't for sound business reasons. It was in
>>>> a megahertz race with AMD, and AMD had beaten Intel to 1GHz. The best
>>>> thing to sell to a customer is what the customer already wants to buy,
>>>> and what the customer wanted to buy was megahertz, and that's what
>>>> NetBurst delivered--for a while, at least.
>>>>
>>>> Maybe someone can help me pin down the exact quote from an Intel
>>>> corner office: "It's the megahertz they buy,"--or something like that.
>>>
>>> In that environment, it becomes even more surprising that they never
>>> tried to claim the internal (doubled) clock rate of the fast core as the
>>> clock speed.
>>
>> They never designed it to sell it on clock-speed. That is a myth.
>
>Sorry, but selling it on clock speed was making the rounds inside Intel
>at the time.
Intel didn't spend billions of dollars on such a flimsy marketing
theory. The chip had to work.
>It was not the only meme. And not the only rationale for Willamette, or
>the fireball.
A meme is a rumor.