Thanks & cheers,
HK
--
Henrik Klagges - IT Analyst
hen...@strategypartners.com
PGPKey available on request
> Wasn't Samsung supposed to be mass-producing .25 21264s since December?
> Has that actually been happening? I'm having a kind of empty feeling in
> the stomach.
Compaq is apparently shipping the DS20 and XP1000 in volume -- both of
which use the 21264. I wouldn't know if Intel or Samsung is fabbing 'em.
-- g
So far, the MHz ratings I've seen for the CPUs shipped in the machines
all looked like they were from Intel.
Cheers,
Intel AND Samsung are fabbing EV6. The mix right now is about 50-50. I
would expect to see Samsung's share increase over time.
I do not know who's fabbing the 667MHz EV67 parts that are being
benchmarked right now.
EV67? I thought EV68 was just around the corner... does EV67 make that much
sense if EV68 is less than a year away?
--Andre
> >I do not know who's fabbing the 667MHz EV67 parts that are being
> >benchmarked right now.
>
> EV67? I thought EV68 was just around the corner... does EV67 make that much
> sense if EV68 is less than a year away?
Any public sources on what the differences between EV6/EV67/EV68 are? Just
process shrinks, or also slight internal improvements (such as 21164/21164A/
21164PC)?
Jan
I don't think any have been publicly disclosed yet but you know
how engineers like to tinker... ;-)
>
> Jan
--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
dem...@mosaid.com architectures with MIPSed results but ALPHA's well
pde...@igs.net that ends well.
On slides at DECUS, EV68 is shown at 60 SpecInt95. I don't
think you can get there by cranking the clock from 525/575 to
1 GHz. Seems to be larger caches, etc. After all , the 21364
slides show 21364 at 70 SpecInt95 and 100 SpecFp95. You know,
looking at Doug Siebert's comments regarding Merced. "HP
engineers downplaying Merced performance." "FP looks good."
Just how pitiful could Merced's integer performance be? Consider:
- At Intel Devel Forum , they talk up fp (see Register
slides)
- Linley "Merced" Gwennap talks up Merced fp in most recent
MPR newsletter.
Back to Doug's comments:
Merced performance was backhandedly downplayed -- "Intel has said that
Foster will be the same speed as Merced on some things, and PA-RISC is
usually around 2x faster than x86, so draw your own conclusions".
We know Merced will have decent fp performance "world class"
according to Intel's slides. But remember how Grandma would
always have us wait to snap the picture so she could
"show her good side?" Same with Merced.
Let's play a little game, suppose PA-RISC is only 1.7 faster than
Merced.. we know Foster will be the same speed [performance] on some
things. Well, it certainly isn't fp given the IA-32 architecture it
is hobbled with. IA-32 Integer is pretty good, still neck and neck
with some of the next-gen chips out there (R12000 for instance, ouch).
Doug mentions 8500 at 32 integer, let's crank that and add some
compiler improvement... that makes it what, at best 30% (probably
not but let's be generous). Does that put the 8600 at 42, okay.
Let's give a 40% improvement going to 8700 (maybe not, but let's
be generous) what are we at? 58 SpecInt95? Pretty good.. even
for 2001. Let's work backwards from 60 SpecInt95 for the 8500 and
say it is 1.7 times better at integer than Merced/Foster. At best
that leaves Merced in the 30 range. This is may be an idle exercise
but what do we have? Anyone know how many integer units Merced has
yet?
Other peripheral evidence is Register articles highlighting compiler
difficulties.
If Merced integer performance was decent, we would hear more about it.
The mythical "50 SpecInt95" on intro hasn't been bandied about in
over 7 months as far as I can tell, anyone know different?
One final point, 8700 would have to be at 100 SpecInt95 to be
twice as fast as Merced's projected 50 SpecInt95.
Rob
Don't get too complacent. From the "IA64 Architecture Innovations"
presentation by Crawford and Huck recently posted at "The Register"
they seem to imply Merced can issue at least three load operations
per cycle and three integer comparisons and a branch per cycle.
The IA64 enhancements like predication have real and substantial
benefits not fully replicable with CMOV. If the compiler delivers
and static scheduling doesn't hinder too much then Merced may
really suprise on integer performance.
>
> Back to Doug's comments:
>
> Merced performance was backhandedly downplayed -- "Intel has said that
> Foster will be the same speed as Merced on some things, and PA-RISC is
> usually around 2x faster than x86, so draw your own conclusions".
The huge pyschological impact of the suprise arrival of the 0.35 um
200 MHz Pentium Pro on the RISC camps and computer market in general
probably hasn't been lost on Intel. Think of it like financial results,
some downplay expectations to pleasantly suprise the analysts. The
announcement of Merced will be the most important Intel has undertaken
in at least a decade and they definitely want positive spin out of it.
Hence the huge wailing Greek chorus of how terrible things are going
on Merced. Maybe Merced really does disappoint the early expectations
of top Intel and HP planners but I bet it isn't nearly the basket case
that some have made out. Intel would also have reason to downplay the
integer performance of Merced so as not to chill the sales of Xeon x86
servers.
Like warfare, woe to the planners that disdainfully underestimates
the capabilities and intentions of their opponent; they usually suffer
the biggest and worst suprises of all.
[snip]
All opinions strictly my own.
> The IA64 enhancements like predication have real and substantial
> benefits not fully replicable with CMOV.
Well our friend Mayan doesn't buy it... perhaps he is risking
a bit of his professional reputation by making such statements like
this to comp.arch:
5. The "code straightening" provided by predication does give VLIWs an
advantage by removing taken branches from the instruction stream,
and thereby reducing the number of fetch bubbles; it is however,
minor. Further, Alpha cmove like instructions will capture most of
that advantage.
Seems to be a difference of opinion there.
> If the compiler delivers
> and static scheduling doesn't hinder too much then Merced may
> really suprise on integer performance.
>
And that seems to be the whole point of Mayan's OOO vs. VLIW piece.
His "executive summary" intro states:
"The highest performance implementation of any general purpose ISA, including
VLIWs, will be an out-of-order implementation. IMHO, the "average" performance
of an out-of-order implementation of a VLIW ISA will be +/-5% of the
performance of an equivalent RISC ISA out-of-order implementation. Actually, my
bias is that an OOO VLIW will slightly under-perform an equivalent OOO RISC."
>>
>> Back to Doug's comments:
>>
>> Merced performance was backhandedly downplayed -- "Intel has said that
>> Foster will be the same speed as Merced on some things, and PA-RISC is
>> usually around 2x faster than x86, so draw your own conclusions".
>
> The huge pyschological impact of the suprise arrival of the 0.35 um
> 200 MHz Pentium Pro on the RISC camps and computer market in general
> probably hasn't been lost on Intel. Think of it like financial results,
> some downplay expectations to pleasantly suprise the analysts.
And Intel is masterful at this. Care to comment on how PPro got
such fantastic results?
> The announcement of Merced will be the most important Intel has undertaken
> in at least a decade and they definitely want positive spin out of it.
Absolutely. So accentuate the positive, hence a whole host of
floating point blather. As if big business is interested in
floating point. Crank the bandwidth and a 21264 gets much better
bandwidth in a DS20 compared to the 8400.
Likewise, The Register slides talk of "world class" database
numbers.. sure, large high-speed, high-bandwidth caches.
> Hence the huge wailing Greek chorus of how terrible things are going
> on Merced. Maybe Merced really does disappoint the early expectations
> of top Intel and HP planners but I bet it isn't nearly the basket case
> that some have made out.
Where are the integer highlights? What if K7 comes close to Merced
on integer? I think the SmartReseller piece is a nice confirmation
of compiler difficulties. I bet we hear similar stories soon.
The comments by the HP people "downplaying" Merced performance
is just another example of circumstantial evidence... never enough
to convict. However, we edge closer and closer to reliable sources
B-).
> Intel would also have reason to downplay the integer performance of Merced so
> as not to chill the sales of Xeon x86 servers.
Not buying that line at all. Merced volume ships in a year and a half.
Boxes end of next year. Maybe if Merced was 6 months out, I would
buy into that argument.
> Like warfare, woe to the planners that disdainfully underestimates
> the capabilities and intentions of their opponent; they usually suffer
> the biggest and worst suprises of all.
>
Yeah. And if in battle you have an indefensible weakness that
can't be shored up... and your only hope is to keep the enemy
from finding out about it, you MIGHT be fortunate if you only
have to hold out for a short while. If however your weakness
is there for many months, your enemy may find out and exploit
that weakness.
Rob
Rob Young wrote:
> Where are the integer highlights? What if K7 comes close to Merced
> on integer?
My guess is that the K7, given it's multiple integer units and a chance to
use them, will beat the Merced on integer. Of course, the K7 will still be
32bit, and AMD still has to *fab* the thing in volume and on time.
> The comments by the HP people "downplaying" Merced performance
> is just another example of circumstantial evidence... never enough
> to convict. However, we edge closer and closer to reliable sources
> B-).
The one interesting thing about their statements were that they were
saying that since they weren't allowed to talk about Merced performance,
they could only reference what Intel has already publicly stated. So if
the "Merced surprise" theory was correct, they could just be following
the agreed upon strategy.
The fact that HP has the PA-8600 through PA-8800 in various stages of
development says to me that either:
1) Won't be on (its already slipped) schedule (still due for samples this
summer so I doubt this one is too likely)
2) Merced won't be up to snuff performance-wise (current belief of most)
3) Will be so expensive HP wants some chips to maintain an affordable and
performance competitive workstation line (i.e., the statements that
"Merced is for servers")
If 3) turns out to be correct, the only possible reason it would be too
expensive for workstations, would be if Intel wants to charge that much
for it. They can only get away with that for a completely new and untested
architecture if the performance advantages over all the existing chips
are compelling, for at least some worthwhile market segment where servers
rule. So either TPC or big FP.
HP built the PA-8500 the way they did because they wanted to bring the
expensive PA-8xxx chipset to the low end workstations (which they'll
resize real soon now with the release of a completely new workstation
line) HP did this despite the 476 sq mm die and 140 million+ transistors.
IMHO, Merced will have more transistors than this (probably same cache
size, but think of all the functional units, and all the registers and
read-write ports) since it'll be done in .18u. Perhaps part of the reason
Intel agreed to fab the 8500 was as practice for Merced?
--
Douglas Siebert Director of Computing Facilities
douglas...@uiowa.edu Division of Mathematical Sciences, U of Iowa
I plan to live forever, or die trying.
I would not describe the elimination of roughly half of conditional branches
as "minor". Besides predication, IA64 has a bunch of little tricks and
features
that in isolation look minor and unimportant. But taken altogether these
things
can work synergistically to open larger blocks and superblocks to
optimization
and increase the opportunities to reschedule code. Two tricks that give you
1%
performance improvement each in isolation might give you 5% when taken
together
and so on.
The open literature on this is quite limited but the paper "A Comparison of
Full and Partial Predicated Execution Support for ILP Processors" by Mahlke,
Hwu et al does address predication vs CMOV directly. Their results for an 8
issue machine with one branch per cycle and 2 cycle mispredict penalty was
conditional move buys you 30% higher performance but full predication buys
you another 30% on top of that. Mind you their dynamic branch prediction
was rather lame (1K entry BTB, 2 bits per entry) but their 2 cycle penalty
is very short for modern CPUs and does understate the advantage of full
predication. Also, your mileage will vary by the extent that your integer
code is dominated by frequent and hard to predict branches.
The question of out of order vs static scheduling is a big factor in
evaluating Merced but nothing stops future IA64s from using OOO (as I
suspect McKinley does). The other question is whether Merced has wide
enough issue and robust enough compilers to show off the architectural
benefit of predication at the time of product announcement.
[snip]
> > The huge pyschological impact of the suprise arrival of the 0.35 um
> > 200 MHz Pentium Pro on the RISC camps and computer market in general
> > probably hasn't been lost on Intel. Think of it like financial results,
> > some downplay expectations to pleasantly suprise the analysts.
>
> And Intel is masterful at this. Care to comment on how PPro got
> such fantastic results?
I don't think that results were so fantanstic as unexpected. Most observers
didn't realize to what great utility that Intel could aggressively apply
superscalar and out of order implementation principles to x86 and deliver
it so soon. To the extent that PPro relied on DEC IP, if that is what
you implying, is quite contentious (not to mention immaterial to this
discussion) and there is no doubt at least one frequent poster would take
serious offense at that implication :-)
[snip]
> > Intel would also have reason to downplay the integer performance of Merced so
> > as not to chill the sales of Xeon x86 servers.
>
> Not buying that line at all. Merced volume ships in a year and a half.
> Boxes end of next year. Maybe if Merced was 6 months out, I would
> buy into that argument.
Good point. But execution of business strategies can have inertia/latency
and maybe the left hand didn't realize the how late the right hand is.
>
> > Like warfare, woe to the planners that disdainfully underestimates
> > the capabilities and intentions of their opponent; they usually suffer
> > the biggest and worst suprises of all.
> >
>
> Yeah. And if in battle you have an indefensible weakness that
> can't be shored up... and your only hope is to keep the enemy
> from finding out about it, you MIGHT be fortunate if you only
> have to hold out for a short while. If however your weakness
> is there for many months, your enemy may find out and exploit
> that weakness.
Weaknesses are just like strengths in that you want to hide both of
them from your opponents. The Intel utter silence on the entire topic
of Merced integer resource/performance is so pervasive and "deafening" I
would say that it sticks out like a sore thumb. But the big emphasis
of the University of Illinois work with compiler innovation for EPIC
like architectures was to attack the problem of "locked-in" ILP in
branch dominated integer code. The only question is whether Intel has
seriously screwed up. I would caution Alpha bigots (a group in which
I include myself) to remember the lesson of the PPro intro.
I'd be most happy to be proven wrong on this point by poor Merced
integer performance come disclosure time but I'd still consider
it suprising. And I'd consider it downright astonishing if Mckinley
had any integer performance shortcomings at all.
>
> Rob
>In article <36EBF5F5...@igs.net>, Paul DeMone <pde...@igs.net> writes:
>> Rob Young wrote:
>> [snip]
>>> Just how pitiful could Merced's integer performance be? Consider:
>>>
>>> - At Intel Devel Forum , they talk up fp (see Register
>>> slides)
>>>
>>> - Linley "Merced" Gwennap talks up Merced fp in most recent
>>> MPR newsletter.
>>
>> Don't get too complacent. From the "IA64 Architecture Innovations"
>> presentation by Crawford and Huck recently posted at "The Register"
>> they seem to imply Merced can issue at least three load operations
>> per cycle and three integer comparisons and a branch per cycle.
>>
> Who's complacent? Just seems to be a whole host of grumbling
> about Merced compiler issues whether it be SmartReseller or The
> Register.
>
>> The IA64 enhancements like predication have real and substantial
>> benefits not fully replicable with CMOV.
>
> Well our friend Mayan doesn't buy it... perhaps he is risking
> a bit of his professional reputation by making such statements like
> this to comp.arch:
>
>5. The "code straightening" provided by predication does give VLIWs an
> advantage by removing taken branches from the instruction stream,
> and thereby reducing the number of fetch bubbles; it is however,
> minor. Further, Alpha cmove like instructions will capture most of
> that advantage.
>
> Seems to be a difference of opinion there.
>
>> If the compiler delivers
>> and static scheduling doesn't hinder too much then Merced may
>> really suprise on integer performance.
>>
> And that seems to be the whole point of Mayan's OOO vs. VLIW piece.
> His "executive summary" intro states:
Neither Mayan nor anyone else ever replied to my set of questions about
just what was causing the limited ILP their studies revealed, and the
extent to which such ILP might improve with such innovations as
predication on every op, large numbers of loads/stores per cycle, a trace
cache, data speculation etc.
It is quite possible that Mayan's numbers were correct, insofar as the
models they investigated, but that EPIC utilizes features that are not in
those models.
Maynard
> Neither Mayan nor anyone else ever replied to my set of questions about
> just what was causing the limited ILP their studies revealed, and the
> extent to which such ILP might improve with such innovations as
> predication on every op, large numbers of loads/stores per cycle, a trace
> cache, data speculation etc.
>
> It is quite possible that Mayan's numbers were correct, insofar as the
> models they investigated, but that EPIC utilizes features that are not in
> those models.
>
I have no doubt that a typical IA-64 chip could easily have higher ILP
than anything else out there. The big issue is what the UILP is.
i.e. how much ILP will be to overcome bottle necks and how much
actually does real work?
Its possible to get UILP from programs by using the techniques that
are described as VLIW(intel calls it epic). It is pretty trivial to
do a 256 entry case statement in a few cycles for instance. This
issue is how much hardware will you have to throw at it. There are
certain limits to what can feasibly be designed into a cpu. The goal
for the ia64 compilers will be to keep the cost getting the increased
UILP low enough that they actually get a performance boost.
i.e. ILP = UILP + GILP(garbage ILP).
Performance increase iff ILP <= machine ILP in most cases.
In otherwords. It won't matter how wide merced or mckinnly are if it
takes so much GILP to reveal the UILP that the GILP + UILP is greater
than the width.
Aaron Spink
not speaking for Compaq
Papers on many of these studies are publicly available, to
anyone with access to a good libary and/or the web. The papers
probably don't go into the detail you want (which papers do?),
but they are probably a good place to start.
Clearly, HP and Intel are extremely slow about publishing
the results of their studies, but there is a fair body of
publically available info in ILP and VLIW compilation.
> and the extent to which such ILP might improve with such
> innovations as predication on every op, large numbers of
> loads/stores per cycle, a trace cache, data speculation etc.
I don't want to get into an unwinnable argument about the
value or otherwise of these features. I would, however, draw
your attention to the fairly large number of papers which
all have titles along the line of "Limits of parallelism".
One such paper is:
http://suif.stanford.edu/~bwilson/papers/isca92/paper.html
but there are lots of others.
The machines in these studies are not limited by architected
branch misprediction penalties, instruction fetch, numbers
of registers, number of memory ops per cycle, issue width,
cache misses, memory disambiguation, etc. Despite the small
number of limitations, the measured ILP speedup on integer code
is low for all but the most unrealistic machines.
If machines with very few limitations are getting small
amounts of ILP speedup, it is not difficult to imagine
why machines with greater limitations also get small
ILP speedups (In fact I'm often impressed by just how well
they do).
There are a few possibly concievable machine models that
crude simulations show to have the _potential_ to get
a fair amount of ILP out of integer code. But I don't believe
they will come to market soon, and the ILP speedup may
dissapear in a real implemenation. (Cue for millions of
people to write in about their favorite machine model and
how it will come to market and sweep all competitors away
in the next six weeks).
If you really want to know why ILP speedups for integer
programs are limited and you aren't satisfied with the
answers in the available literature, you could do a lot
worse than do some of your own experiments. An execution
driven simulator for _simple_ ILP machine models (of the
type found in these papers) is easy enough to write. I have
done some of these types of studies using Atom, and it wasn't
too much work.
Optionally, you could try one of the existing execution
driven simulators for simple ILP machines, or for more
realistic results (but fewer facilities for trying out
all sorts of bizarre machine models) a machine simulator
such a simplscalar.
Note, as Josh Fisher has pointed out, you can't believe
these studies as the absolute truth and last word on ILP.
Future compilers may transform the program in some currently
unimaginable way to increase the ILP. (Cue for everyone to
write in about their favorite compiler technique and how
it will hugely increase ILP tommorow (or yesterday :-) )).
Finally, I'm fully aware of the shortcomings of these studies,
and I'm not defending them in any way. They provide some guidance,
but they are not an exact science. Lots of the assumptions are
unrealistic.
And really finally, apologies if you have already read these
studies and done lots of your own. It is kind of irritating
when everyone points out the obvious, but it is not always
clear what is obvious to whom (although some things are clearly
obvious to almost everyone :-) ).
Best regards,
David.
>Maynard Handley wrote:
>>
>> Neither Mayan nor anyone else ever replied to my set
>> of questions about just what was causing the limited ILP
>> their studies revealed,
>
>Papers on many of these studies are publicly available, to
>anyone with access to a good libary and/or the web. The papers
>probably don't go into the detail you want (which papers do?),
>but they are probably a good place to start.
You missed my point. I was not asking for references.
My point was that while Mayan claims there is very limited ILP available
(and that EPIC is thus overkill), other studies have found substantial
amounts of ILP available. Mayan stated his belief in limited ILP without
giving details of the simulations that had led to that belief. Without
info about those simulations, we can't state why, perhaps, his numbers are
not relevant to EPIC.
Maynard
> You missed my point. I was not asking for references.
> My point was that while Mayan claims there is very limited ILP available
> (and that EPIC is thus overkill), other studies have found substantial
> amounts of ILP available. Mayan stated his belief in limited ILP without
> giving details of the simulations that had led to that belief. Without
> info about those simulations, we can't state why, perhaps, his numbers are
> not relevant to EPIC.
I don't really think the question is whether there is ILP. The issue
is what is needed to extract that ILP. As I have said before, it will
not be hard to create an IA64 machine that has high ILP. This doesn't
mean it will be any faster architecturally than a single issue
in-order processor.
The problem lies in the fact that to extract more and more ILP from a
program you invariably have to execute more instructions than are
needed. As the search for higher ILP goes on this will only get to be
a bigger ratio. Whether you predicate and prefetch by software, or
you speculate and reorder by dynamic means, you will still end you
executing more instructions than needed. There is an argument to be
made that you actually are on a low ratio curve with SSOOO than with
IO IA64 style architectures.
So there are studies that find that for some program you can execute
infinite instructions in parallel. The question to ask about all
limit instructions is what is needed to reach the limit. In most
cases, even the conservative limit studies are going to be absolutely
undoable for at least 15-20 years.
So in summary, there can both be limited and unlimited ILP within a
program. The limited ILP is the max you can actually achieve and the
unlimited ILP is the extra work required to reach that limit.
Aaron Spink
not speaking for COmpaq
> Hmmm.... This seems even more aggressive than the mythical 40
> SPECint95 at 600 MHz. Too bad the real thing only delivers
> 26 at 500. That scales up to about 35 at 667 MHz -- still an
> extraordinarily fast processor, but 35 is not 50.....
>
Just a point of correction. Its 27.7 at 500. What are you trying to
do, gip us out of an extra 6%. :)
That extra 6% could be the difference between EV6 and USIII. :)
Aaron Spink
not speaking for Compaq
Have you any ideas on hybrid value predictors? I have read a paper
recently
on hybrid value predictors combining 2-level adaptive value predictors
with
stride predictors..... Implementation seems feasible in the 300~500
Million
transistor arena....
The paper is: Techniques for performing highly acurate data value
prediction.
Rahul Sathe, Kai Wang, Manoj Franklin
Microprocessors and Microsystems Journal
This seems promising for helping allieviat the Data Flow limit, while
Trace Caches and Predication help with Control Flow limits......
A penny for you thoughts
Chris Gomez
This seems unlikely. SPECint95 on the R10000 only varies 5% when the
L2 speed is changed from 1:1 down to 2:1. Similar results seem to hold
for the Intel Pentium II Xeon systems.
Without going into too much detail, SPECint95 is not L2 bandwidth-
limited, and the OOO cores of the R10000 and Pentium II can tolerate
latencies in the 10 cycle range. I would expect the 21264 to also
show similar insensitivity of SPECint95 with respect to small changes
in the L2 cache latency.
--
--
John D. McCalpin, Ph.D. Principal Scientist
System Architecture Group http://reality.sgi.com/mccalpin/
Silicon Graphics, Inc. mcca...@sgi.com 650-933-7407
> The one interesting thing about their statements were that they were
> saying that since they weren't allowed to talk about Merced performance,
> they could only reference what Intel has already publicly stated. So if
> the "Merced surprise" theory was correct, they could just be following
> the agreed upon strategy.
I don't know that you can read a lot into the silence.
Intel has a policy that they will not comment on performance numbers
unless someone else has done the work. this sounds a bit bizarre and
lazy, but there are some sound (for Intel) reasons for it: if they
claim some performance, and someone else buys their chips but
can't duplicate the performance, then deep-pockets Intel gets sued.
If they just say that someone else has reported these numbers, its
not a problem.
>
> The fact that HP has the PA-8600 through PA-8800 in various stages of
> development says to me that either:
>
> 1) Won't be on (its already slipped) schedule (still due for samples this
> summer so I doubt this one is too likely)
> 2) Merced won't be up to snuff performance-wise (current belief of most)
> 3) Will be so expensive HP wants some chips to maintain an affordable and
> performance competitive workstation line (i.e., the statements that
> "Merced is for servers")
Or it says that, even if Merced is successful, timely, and price and performance
competitive, many of their customers will continue to buy PA-8xxx because it will be
cheaper than migration.
Again, don't attribute to incompetence what you can attribute to common-sense.
Don't forget the DS20 and XP1000 (and probably the 8400 board too)
are using commodity burst SRAMs in the L2 cache. The 21264 also
supports late write SRAMs and more importantly, DDR SRAMs. We
all know that SPECint95 doesn't have a huge working set footprint
but it sure the heck doesn't fit in 64K L1's (especially with loops
and data structures padded out to octaword alignment)
I would expect some of the SPECint95 fuzziness surrounding the
21264 is due to early numbers assuming the best L2 hardware
possible. In contrast SPECfp95 sniffs disdainfully at 4MB L2s
and looks to main memory performance for redemption. ;-)
>To me it seems like a CMOV will always introduce a bubble,
Why? IMHO a P6 CMOV may be implemented as an operation with 2 input
registers (the previous value of the target register and the source),
with the selection done by a 3rd input operand, the condition. Data
paths and scheduling are comparable to ADC. The result is always
available on the result 1 cycle later, independant of the condition.
Of course you have to wait for the condition, but this is the right
time to wait for it, no matter if with or without OOO. So where is the
bubble?
On the IA64, a "conditional mode" makes more sense when viewed as a
"select" operation. Especially if there is no renaming.
>As I stated above, unless Merced has a way to implicitely merge the two
>results of a predicated branch, you will have to issue a predicated MOV
>as the last instruction of one of the paths.
But only if there actually are two data paths which join later.
Occasionally the conditionally executed forks have their own data
paths (within the limited scope where predictation makes sense).
Anyway, as far as I understand the design target, IA64 tries to reduce
the necessary amount of cleverness in hardware and delegates it to
software, thus may occasionally need some more instructions to do the
same job but may have more resources available to execute them. Just
like the CISC to RISC step tried some time ago.
Gruss, Andreas
Did you receive my first email? I mailed from my
school account. from ch...@mil.ufl.edu.
> I think you had sent me earlier email, which I accidentally
> deleted in my sleep deprived state (1 week old baby).
Well it was just an introduction of myself, although rather lengthly.
I wrote it and contacted you because I found your articles on comp.arch
and then on Wisconsins Architecture Site.... So I visited your web page
and agreed with many of your viewpoints, especially on Higher Education.
I will forward it to you again....
> Anyway,
>
> If data value prediction works, things like 2 level will be used.
> I'm not a big fan of data value prediction, however. If it can
> fit in a small table, then why use prediction - execute it - and if it can't
> fit in a small table, then execution is the only way.
Please explain what do you mean by a small table, I was thinking that
dependence
chains limit the ILP and that many instrucions which couldn't even be
issued much
less dispatched could now be issued and dispatched OOO with data
prediction....
> I also diss trace caches. I'm allowed - I think that I invented the term
> and probably preceded Peleg and Weiser in inventing the concept.
> I originated them while (a) trying to do MCD for my MS - the basic idea
> was to do long cache lines where I could detect hammocks, and
> (b) trying to do a decoded i-cache for the M68K.
First what is MCD, I am curious? And what is a hammock?
You were working on predecoding 68K instructions before entry
into the ICACHE? That would allow a lot of scheduling up front
before the fetch stage wouldn't it, I refer to Miss Path Scheduling...
> Annoyingly, predication looks like a good idea. I say annoyingly,
> because I can do things with predication that I cannot do without
> predication. Usually this means that I just need to thing harder.
> Certainly, I can dunamically apply predicates in the microarchitecture
> to non-predicated code, but that loses much of the benefit.
Predication adds code bloat and some garbage ILP but if
it allows the exposure of some real ILP then it might
be more than useful, I refer here to S. Mahlke and Patt et al...
If conversion could expose some more ILP beyond control flow
dependence graphs using just prediction, even if they are hybrid
predictors...
Chris Gomez
All opinions are my own.
--Chris
My own views.
: Needless to say, I think that the dataflow limit can be surpassed.
Hey Andy, Could you explain that?
* Are you saying that 40 IPC can be surpassed? How?
* Are you also assuming an infinite instruction window?
* Do you have a cite for the 40 IPC number for gcc?
Thanks,
Mark "full of questions" Brehob
--
~~~~~~~~~~~~~~~~~~~~~~~bre...@cps.msu.edu~~~~~~~~~~~~~~~~~~~~~~~~~
| | The reports of SIMD's death have been greatly exaggerated | |
| -=-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- |
~~~~~~Mark Brehob: Ultimate Player, Gamer, Computer Geek~~~~~~~~~~
In the ideal world, when we wanted to compare the value of an
architectural feature, we would get two teams. One would go off
and build a compiler for the architecture without the feature,
and one would build the compiler for the architecture with the
feature. Finally, we would compare performances using the
code produced by the compilers optimized for architectures with and
without that feature.
This is particularly true when considering a "deep" feature like
predication. The software structure [IR, optimizations, analysis
phases] of a compiler targeted for a predicated architecture is
quite different from that targeted to a non-predicated architecture.
Consider the case of software pipelining using modulo-scheduling.
If the architecture supported full pipelining, it would be reasonable
to assume that all inner loops to be software pipelined would
contain only straight-line code. If the compiler did that, then
turning off predication would degrade performance of inner-loops
containing branches - even though a more general software pipelining
algorithm would still work.
Lets take another, more concrete example - the cmppt inner loop
from eqntott, which is where eqntott spends most of its time.
It looks like:
short a[];
short b[];
for( i = 0; i < N; i++ ) {
x = a[i];
y = b[i];
if( x == 2 ) {
x = 0;
}
if( y == 2 ) {
y = 0;
}
if( x != y ) {
break;
}
}
Obviously, with full predication, this becomes
s = i < N;
while( s ) {
x = a[i];
y = b[i];
p = x == 2;
q = y == 2;
x = p ? x : 0;
y = q ? y : 0;
s = x == y && i < N;
}
and then this can be software pipelined.
Now, for a machine without predication, it turns out that a good way
of compiling that loop is to simply convert the inner loop into
i = 0;
loop:
for( ; i < N/sizeof(long); i ++ ){
x = ((long*)a)[i];
y = ((long*)b)[i];
if( x != y ) {
goto fixup;
/* fixup: some complex code which converts 2's to 0's
* and then reenters the loop if it turns out that
* (x == 2 && y == 0) || (x == 0 && y == 2)
*/
}
}
Given the data patterns used in eqntott, it turns out that the loop is
very rarely re-entered. So, this optimization can really speed
up the loop.
So, the difference in performance in eqntott using best-of-breed compilers
for both with and without predication should definitely not be 2.75x.
In our VLIW work, for our base case, we used xlc, the IBM C product compiler.
It is a very good compiler tuned for the PowerPC architecture.
We computed ILP using the following rule:
#PowerPC instructions in official SPECpeak runs
ILP = ------------------------------------------
#Cycles taken by VLIW code.
We felt that this was a fairer statement of ILP than using our compiler
as the base case. Further, it allows us to do an apples-to-apples comparison
of potential ILP based performance improvements over Power/PowerPC
implementations.
This explains in part why our ILP numbers for gcc and many of the
other SPECint95 programs are in the 2.5-3 range.
Much of the work is available in the VLIW home page at
http://www.research.ibm.com/vliw
or through the CyberDigest at
http://domino.watson.ibm.com/library/CyberDig.nsf/home
Mayan Moudgill
--
------------------------------------------------------------------------------
| Mayan Moudgill | These are _MY_ opinions. Any resemblance |
| ma...@watson.ibm.com | to IBM's opinions is purely coincidental |
------------------------------------------------------------------------------
The cycle for the CMOV itself. Consider:
if (a < 0)
b += a;
else
b -= a;
c += b;
d += b;
With branches:
bgez ra, skip0
add rb, rb, ra
b skip1
skip0:
sub rb, rb, ra
skip1:
add rc, rc, rb
add rd, rd, rb
With CMOV:
cmp cond, ra, 0
add rtemp0, rb, ra
sub rtemp1, rb, ra # Or calculate the add or the sub into rb
cmov.lt rb, rtemp0, rtemp1 # and use a two operand CMOV to move
# the other.
add rc, rc, rb
add rd, rd, rb
With correct branch prediction, the branchy code runs in 2 cycles. First
add or sub in cycle 1, two dependent adds in the next. Assuming no back to
back ALU in a single cycle. (If you don't like compare-and-branch, just
throw a compare instruction in the first cycle. We're talking infinite
issue bandwith here right?)
The CMOV code probably takes 3. The cmp, add, and sub in the first cycle,
followed by the dependent cmov in cycle two, followed by two dependent adds
in cycle 3. Of course one can try to "fold out" the CMOV earlier in the
pipeline by playing games with register routing. I can imagine
implementations doing so, though it kind of amounts to putting a
conditional unit on the register file read path. (But is that any worse
than implementing predication efficiently?)
In theory, with IA-64 the code executes in two cycles (compare and
dependent predicated add and sub in one cycle, and the two dependent adds
in the next).
-Z-
>This seems unlikely. SPECint95 on the R10000 only varies 5% when the
>L2 speed is changed from 1:1 down to 2:1. Similar results seem to hold
>for the Intel Pentium II Xeon systems.
>Without going into too much detail, SPECint95 is not L2 bandwidth-
>limited, and the OOO cores of the R10000 and Pentium II can tolerate
>latencies in the 10 cycle range. I would expect the 21264 to also
>show similar insensitivity of SPECint95 with respect to small changes
>in the L2 cache latency.
How about L2 cache size? Alphas have had up to 8MB cache with them
before, maybe the results were on some high end server class machine
that has a very big cache? Of course, such a thing wouldn't do much
more than win some extra bragging until SPEC99 finally makes an
appearance to keep the benchmark world safe from oversize caches for
a couple more iterations of Moore's Law.
It will be interesting to see "who wins" and "who loses" with SPEC99.
Does anyone have any gut feelings as to how that'll turn out? Or
at least know when we'll start seeing SPEC99 results? :)
Cache size does not change SPECint95 at all, once the cache
size has reached 1 MB. I run on 1, 2, 4, and 8 MB caches,
and the variation is at most a couple of percent....
Hate to follow-up to my own posting, but it occured to me that an
implementation can also fold the CMOV at the backend by arranging to have
the rename stage assign the same destination register for both instructions
which are inputs to the CMOV and then quash the one that isn't going to
happen. Or put another way, just have the front part of the pipeline
synthesize predication for the two instructions that produce the results
going into the CMOV. Though these techniques will only work across a "narrow"
window because the fornt part of the pipe must see both instructions and
the CMOV "at the same time" in order to do this. (Narrow in quotes 'cause
we're talking about idealized implementations and real ones are starting to
have pretty large lookahead windows.)
-Z-
In your ideal world, those two compiler teams would of course both be
perfect, right? :-)
[snip]
> Lets take another, more concrete example - the cmppt inner loop
> from eqntott, which is where eqntott spends most of its time.
> It looks like:
> short a[];
> short b[];
> for( i = 0; i < N; i++ ) {
> x = a[i];
> y = b[i];
> if( x == 2 ) {
> x = 0;
> }
> if( y == 2 ) {
> y = 0;
> }
> if( x != y ) {
> break;
> }
> }
This is definitely an ugly piece of code. :-(
> Obviously, with full predication, this becomes
> s = i < N;
> while( s ) {
> x = a[i];
> y = b[i];
> p = x == 2;
> q = y == 2;
> x = p ? x : 0;
> y = q ? y : 0;
> s = x == y && i < N;
> }
Actually, since we can assume that IA64 will have packed byte/short/long
operations, it would make a lot of sense to auto-parallize the code, see
below.
> and then this can be software pipelined.
>
> Now, for a machine without predication, it turns out that a good way
> of compiling that loop is to simply convert the inner loop into
> i = 0;
> loop:
> for( ; i < N/sizeof(long); i ++ ){
> x = ((long*)a)[i];
> y = ((long*)b)[i];
> if( x != y ) {
> goto fixup;
> /* fixup: some complex code which converts 2's to 0's
> * and then reenters the loop if it turns out that
> * (x == 2 && y == 0) || (x == 0 && y == 2)
> */
> }
> }
> Given the data patterns used in eqntott, it turns out that the loop is
> very rarely re-entered. So, this optimization can really speed
> up the loop.
The last paragraph is the crux, the only way I can consider this
optimization to be valid would be if the XLC compiler uses trace
feedback to determine those patterns, and then decides that the rewrite
makes sense. (I would also like to see at least one example of
real-world code where the same optimization produces a gain!)
An architecture with packed operations can do similar tricks without
suffering any penalty in case of getting more ill-behaved input data
(This is how I would do it on an Alpha):
for(i = 0 ; i < N/sizeof(uint64); i ++ ){
x = ((uint64*)a)[i];
y = ((uint64*)b)[i];
x_mask = packed_byte_compare_not_equal(x, 0x0202020202020202);
y_mask = packed_byte_compare_not_equal(y, 0x0202020202020202);
x &= x_mask;
y &= y_mask;
if( x != y ) break;
}
This works because the replacement data value is 0, a more general case
would require another mask and a merge, unless the architecture supports
masked MOV operations.
The downside to my version is of course that the loop latency have been
increased by two cycles, but the fixup code would effectively have to do
the same tests anyway, after first mispredicting the branch into the
fixup code, so as long as my loop can be software unrolled far enough to
saturate the data load units, it should run equally fast.
Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
(x == 2 ? 0 : x) != (y == 2 ? 0 : y)
We are testing whether x = y, or x = 2 and y = 0, or x = 0 and y = 2.
We can rewrite this as
(x != y) && (x | y) != 2
So (after fetching x and y, perhaps with zero or sign extension)
t1 = x | y;
t2 = x - y;
if (t2 != 0) t2 = t1 - 2;
if (t2 != 0) ...
One can modify this to test four shorts at once
in 64-bit registers. If x and y have four such values, then
t1 = x | y;
t2 = x ^ y;
t3 = t1 ^ 0x0002000200020002;
We want to check whether some 16-bit field is nonzero
in both t2 and t3. If the architecture has a SIMD
minimum for 16-bit values, called MIN16, then check whether
MIN16(t2, t3) != 0
Without this, suppose there is a CMPBGE (compare byte) instruction,
as on the DEC Alpha. Fold the shorts into bytes wit
t4 = (t2 >> 8) | t4;
t5 = (t3 >> 8) | t5;
and test (CMPBGE(0, t4) | CMPBGE(0, t5)) & 0x55 == 0
--
Peter-Lawren...@cwi.nl Home: San Rafael, California
Microsoft Research and CWI
This is exactly the same realization I made after posting, but in regard
to IA64 and predicated instructions:
By setting up both paths of the branch to generate the result in the
same register, there's no need to explicitely move it from one path
target to the other.
This however cannot seem to work without hardware renaming of registers,
since there would be no way to physically address the same register in
both paths simultaneously.
The best an IA-64 compiler would seem to be capable of would be to guess
(measure?) which path was the least likely, and then insert the
predicated mov at the end of this path.
I don't understand. Presumeably on IA-64 the compiler will arrange for the
last instruction of each path to produce its result in the same register.
Due to predication, only one will execute. The slides seem to indicate that
IA-64 allows compares in a given bundle to influence the output of other
instructions in the same bundle. (No biggie there as by the time you get to
the writeback stage, the compare output is already known.)
-Z-
Compare the ratio of SPECint95 to STREAM copy bandwidth. The higher
this ratio the bigger the negative impact of SPEC99 relative to
competitor CPUs with smaller ratios will probably be (cache busting
and all that).
CPU SPECint95 STREAM copy Ratio
(GB/s)
21264/500 27.7 1.077 26
21164/600 19.3 0.228 85
PII/400 15.8 0.304 52
PA8000/180 11.8 0.262 45
R10000/250 14.7 0.332 44
P2SC/160 8.6 0.779 11
PPC G3/266 12.4 0.137 91
PPC 604/200 8.0 0.083 96
USII/200 7.9 0.229 35
The POWER and EV6X will probably do relatively better on SPEC99 while
PowerPC and EV5 will come out losers. The rest will probably stay
roughly about the same relative position.
I haven't seen any STREAM number for the PA-8500 but my understanding
is it uses the old HP runway bus (probably ~0.35 GB/s or less on copy)
so it would likely to come out a loser.
> > There isn't a mythical 40 at 600. You must have gotten ahold
> > of an overseas source or a marketing number.
>
> I seem to recall the 40 and 60 numbers from an ISSCC conference in SFO
> two years ago. The estimate was 40 SPECint95 and 60SPECfp95 at 600MHz
> for a 21264.
I remember indistinctly an explanation that the 40/60 numbers were thrown out
by some marketeer, never really officially retracted, but that a head rolled
in consequence. Thus, the only "official" prediction was and is 30/50, but
I can't remember whether that was for 500 or 600 MHz.
Jan
I agree, I have been flip-flopping between thinking that this should be
easy and that it wouldn't work at all.
It all depends upon the validity of specifying the same target register
for two oppositely predicated instructions, possibly (probably) inside
the same bundle.
After the point where the paths merge, processing cannot continue until
the predicate is known, so if the predicate can have very long latency,
and nothing else useful can be done with the available execution units,
the compiler will have to delay the path merging, effectively doing all
(post-merge) operations twice, and then discarding the half
corresponding to the non-predicated path.
Interesting optimization problem. :-)
Rob
> That ratio should get better with DDSRAM.. at 5.2 GByte/Sec
> bandwidth , should yield 2.6 GByte/Sec STREAM.
???
Did you perhaps mean DDR SDRAM?
Achim Gratz.
--+<[ It's the small pleasures that make life so miserable. ]>+--
WWW: http://www.inf.tu-dresden.de/~ag7/{english/}
E-Mail: gr...@ite.inf.tu-dresden.de
Phone: +49 351 463 - 8325
From MPR Report Volume 10, Issue 14
Digital 21264 Sets New Standard
"Along with a consortium of leading SRAM vendors that includes Motorola and
Samsung, Digital has specified a new "dual-data" SRAM. [small snip] Using both
edges of a 167-MHz clock, these parts can produce data at 333 MHz, increasing
the 21264's cache bandwidth to its maximum 5.3 Gbytes/s."
Rob
> In article <cvhfyd...@ite127.inf.tu-dresden.de>, Achim Gratz <gr...@ite.inf.tu-dresden.de> writes:
>> you...@eisner.decus.org (Rob Young) writes:
>>> That ratio should get better with DDSRAM.. at 5.2 GByte/Sec
>> Did you perhaps mean DDR SDRAM?
Dual Data Rate Synchronous Dynamic RAM, right? Using both edges of a
clock to transfer data, effectively doubling the bandwidth, right?
> "dual-data" SRAM, DDSRAM .. maybe not a good choice on my part.
[explanation snipped]
So, same thing, only with Static RAM instead of Sync Dynamic RAM,
right?
Wait a second - doesn't the SRAM L2 caches in e.g. PIIs already
deliver data at rates from 167MHz (PII 333) to 450MHz (PII Xeon 450).
So, is the "new standard" that 21264 sets inferior to old fashioned
"SDR" technology form Intel? Or are they planning to use SRAM for
main memory? What gives?
-kzm
--
If I haven't seen further, it is by standing in the footprints of giants
I looked at the Trimaran website (which might have something to do with
Merced, but nobody really knows). The machine described there lets you
create two opposite predicates in one instruction, for example the
predicate a<b and a>=b.
This seems to indicate that there are no separate "do this if predicate is
true" and "do this if predicate is false" instructions, so for opposite
predication you would have to specify two different predicates.
The processor cannot know that these two predicates are related, so this
would give you quite a problem.
You have just described what I call the "Predicate Dataflow Latency Problem".
I am sure that you can think of more solutions than just this.
> I haven't seen any STREAM number for the PA-8500 but my understanding
> is it uses the old HP runway bus (probably ~0.35 GB/s or less on copy)
> so it would likely to come out a loser.
The PA-8500 in the C360 uses the old runway bus. The new workstations
being introduced April 1st use a new double clocked runway bus, which I
believe is also supposed to have lower latency as well. This is where
the "preliminary" results of 32 SPECint and 52 SPECfp came from. I
would assume given the various improvements in it it should pretty
much double the STREAM on the new HP stuff. Still not up to where IBM
has been for a while now, and Digital has got recently. Hopefully HP
is working on something new or made other improvements to the new
Runway to let it come closer to its theoretical max (1920 MB/sec, or
1536 MB/sec including overhead) than the old one did (768 MB/sec
including overhead, but never realizing much more than 320 MB/sec)
I will be ordering a few of the new workstations once they are announced,
so if no one else has reported STREAM results by the time I get them I'll
be sure and do so.
>
> Wait a second - doesn't the SRAM L2 caches in e.g. PIIs already
> deliver data at rates from 167MHz (PII 333) to 450MHz (PII Xeon 450).
>
> So, is the "new standard" that 21264 sets inferior to old fashioned
> "SDR" technology form Intel? Or are they planning to use SRAM for
> main memory? What gives?
>
Well, I can't comment on unannounced products, but the technology that
intel uses for the Xeons is a custom designed deep pipelined SRAM.
They make it on the gen-1 fab lines so it doesn't steal wafer space
with the actual processors. According to MPR the Xeons actually have
worse L2 latency than either the standard P-1/2s or the celerons(which
have the best of any P6 based product).
So the big difference is that DDR-SRAM is a whole lot cheaper than
doing a custom designed off chip SRAM.
Aaron Spink
not speaking for compaq
As has been publicly disclosed, in general in IA-64, an instruction reading
any resource (such as a predicate) must be in a later instruction group
(cycle) than the instruction writing that resource. As a special case,
branches are allowed to use a predicate written by another instruction in
the same instruction group (as shown in the IDF slides).
So, the straightforward (but slow) IA-64 schedule for the earlier example:
if (a < 0)
b += a;
else
b -= a;
c += b;
d += b;
would be:
cmp.lt pLT, pNLT = a, 0 // pLT & pNLT are 2 complementary preds
;;
(pLT) add b = b, a // add to b [then]
(pNLT) sub b = b, a // or sub from b [else]
;;
add c = c, b // uses of b
add d = d, b
;;
which takes 5 instructions in 3 cycles. (Note: In IA-64 assembly, ";;"
indicates the end of an instruction group, "=" separates the target
operand(s) from the source(s), "//" begins a comment, and (pred)
specifies the controlling predicate.)
An alternate (faster) schedule in IA-64 is as follows:
sub bTmp = b, a // speculatively sub from b (into temp)
add b = b, a // and add to b
cmp.lt pLT, pNLT = a, 0
;;
(pLT) add c = c, b // uses of b [then]
(pLT) add d = d, b
(pNLT) add c = c, bTmp // uses of b (temp) [else]
(pNLT) add d = d, bTmp
(pNLT) mov b = bTmp // move bTmp to b [else]
;;
This takes 8 instructions in 2 cycles and one extra register. The final
move of bTmp to b can be eliminated if b isn't live out at that point.
-- Jim
My fault. I am messing this up badly. The system bus
(point-to-point path to main memory) is 64-bit and runs at
333 MHz.. 2.0 GByte/sec. Unless this gets better in future
21264 chipsets/motherboards I believe that STREAM won't get better.
I shouldn't even had been talking about L2 (SRAM). We do know STREAM
gets better in 21364 for sure with an on-chip memory controller.
Regarding Pentiums.. what is the system bus width and speed,
the better the system bus, the better the STREAM number (with
equal memory type).
21264 details are available in several places, click on
http://www.mdronline.com/ and search for 21264 if interested.
Rob
: The processor cannot know that these two predicates are related, so this
: would give you quite a problem.
By "oppositely predicated" I meant "guaranteed by program logic that one or
the other executes." This perhaps will not be known until the the writeback
stage where one of the outputs will be quashed. If niether executes, the
register is of course unchanged. The question is what happens when both
execute. Probably undefined.
Jim Hull cleared up the scheduling issue quite handily. (I was confused by
the example where compares can affect a branch in the same cycle. I went
back and looked at it last night and sure enough, there were no examples
with a compare and its dependent computational instruction in the same
cycle.) However Christian's issue comes up anyway, just in a later cycle.
And Jim's example makes it clear that Merced does allow the case of two
instructions writing to the same register in a single bundle.
-Z-
: sub bTmp = b, a // speculatively sub from b (into temp)
: add b = b, a // and add to b
: cmp.lt pLT, pNLT = a, 0
: ;;
: (pLT) add c = c, b // uses of b [then]
: (pLT) add d = d, b
: (pNLT) add c = c, bTmp // uses of b (temp) [else]
: (pNLT) add d = d, bTmp
: (pNLT) mov b = bTmp // move bTmp to b [else]
: ;;
: This takes 8 instructions in 2 cycles and one extra register.
Implying Merced will have at least five integer function units? :-)
G.
Chris Ruemmler wrote:
>
> In article <36EF1E...@world.std.com>,
> Terry C. Shannon <sha...@world.std.com> wrote:
> >It had better deliver that kind of performance since EV67 at 667MHz is
> >putting out about 50SPECint95 / 70SPECfp95.
> >
> Given a 500Mhz 21264 gets 23.6 SpecInt95_BASE and 27.7 SpecInt95_PEAK
> in the "good" CEC design, I find it hard to believe that a 33% increase
> in clock is going to deliver between a 80-100% speedup in SpecInt95
> performance! Either the Compaq compiler guys just were terrible
> (which I doubt) or something is quite wrong here. Even with faster and
> larger caches this does not compute for SpecInt95 :).
Chris is right, this speedup jump is not going to happen, IMHO. Neither I
think we'll see the 50 at 667, nor the 100int/150fp once claimed for the
GHz piece. But it'll still be top dog.
Cheers,
HK
--
Henrik Klagges - IT Analyst
hen...@strategypartners.com
PGPKey available on request
Based on the numbers (e.g. clock frequency increase), this is correct.
The 50 SPECint95 number I obtained from a CPQ person indeed is specious.
Using an 8MB external cache, the 667MHz EV67 part is delivering ~70
SPECfp95 right now. As for SPECint95, 35-38 seems more reasonable than
50.
CPQ's goal is to ship EV67 at 700-750MHz which could nudge the SPECint
past 40, but time will tell.
Terry Shannon
http://www.acersoft.com
A more general advantage of SMT is that non-speculative execution can be
given priority over speculative execution dynamically. The fetch unit can
give preference to those threads that are not attempting to speculatively
fetch past a branch instruction. A thread which is not at risk of being
deleted by a pending branch instruction can be given preference over threads
that are executing in the speculative shadow of a branch instruction.
Speculation can occur only when there are sufficiently idle resources,
increasing actual throughput.
The liability of predication is that it statically consumes resources that
might have a better use.
Hugh
----------
In article <7co2ku$9no$1...@news.btv.ibm.com>, ma...@watson.ibm.com (mayan)
wrote:
>
[snip]
> IMHO, predication mainly helps by removing some of the 10% or so hard to
> predict branches. Concretely, assume that we have a if-then statement,
> which can be coverted into a predicated instruction
> If that instruction had not been predicated, and we had mispredicted,
> then the pipeline would have been flushed, and the instruction would have
> been delayed by the branch-mispredict penalty.
>
> Actually the main benefit of predication in an OOO implementation has
> less to do with the predicated instructions, and more to do with
> the fact that we can execute independent instructions from below
> the predicated instructions, possibly before the predicate is known,
> and these instructoins will not get flushed because of misprediction.
>
> This advantage of predication is counter-balanced by the fact that
> predication introduces more instructions on the taken path, and
> that it delays the execution of instructions. The first point is obvious.
[snip]
As will be done in Alpha 21464 "Arana."
Technically, that's not SMT. The standard terms are "eager execution",
"multipath execution" or "polypath" execution. There were two papers
in ISCA'98 concerning multipath execution, another paper in the
Intl. Supercomputing Conference in Austrailia '98 by a group at
Princeton and another group from Barcelona on similar designs.
The implemntation in a traditional scalar processor isn't that
hard. Determining when to initiate multipath execution is the more
difficult issue.
SMT, on the other hand, mixes instructions from different threads or
different programs. Those threads use normal speculative execution.
You can also combine SMT and multipath execution.
> As will be done in Alpha 21464 "Arana."
--
While Digital has publically stated that a future processor will
support SMT, I don't believe they've said anythin about multipath
execution.
--
Dirk Grunwald Assoc. Prof, Univ. of Colorado at Boulder, Currently
living at Compaq-WRL and having a darn good time of it. No one
listens to me at the University, I don't see why they would do it at
Compaq.
In my mind, simultaneous multithreading (SMT) is an implementation techniqe
relevant to multithreaded architectures. As with any implementation
technique, it will be more well suited to some multithreaded architectures
than to others. (E.g. SMT will show up sooner for short instruction word
multithreaded architectures than for long instruction word ones.)
: A more general advantage of SMT is that non-speculative execution can be
: given priority over speculative execution dynamically.
And here we get to the meat of things. You are talking about speculative
execution. The speculative threads created above must have a mechanism to
control their side effects until it is known that the thread is not
speculative. One can certainly layer this on SMT hardware, but it has
little to do with a multithreaded programming model. (And such hardware
will need all the predication support under the covers anyway.)
There is lots of research in how to manage speculation entirely within
hardware, perhaps using hints from the programming model. Look under the
heading "super speculative." Some techniques get very elborate, pruning
speculative paths as branches are either resolved or their prediction
parameters are modified. Such architectures are yet another way of using
huge amounts of hardware, but they are orthogonal to an exposed
multithreaded programming model.
Of course such speculation raises all sorts of problems with side effects
such as loads and stores. Both performance and semantics wise. (E.g.
specualtive loads and stores can slow the program down, or break the
program if they are to I/O devices, etc.)
: The liability of predication is that it statically consumes resources that
: might have a better use.
You can always look upon the predicates as semantic information and execute
the program in a different fashion than one otherwise would expect. Its
solely a question of the amount of hardware you're going to dedicate to
"recompiling" the instruction stream at runtime. SMT dynamically assigns
multiple streams to functional units. Predication can be thought of as
doing the same thing except the compiler has to interlace the streams
statically. The proposed speculation mechanism generates the streams
dynamically from a single instruction stream. Think of the differences in
the programming model before the differences in the hardware.
-Z-
Terminology check: what you describe is an implementation
of "Eager Execution" - executing both sides of some branches.
Perhaps even what Augustus Uht calls DEE - Disjoint Eager Execution
- in light of the expected benefit heuristic. (One of those silly things
I consider obvious to WANT to do.)
SMT (Simultaneous Multi-Threading) is usually reserved for executing
completely independent programs, or at least threads. Though, of course,
a multi-threaded microarchitecture might be a good basis for other ILP
optimizations.
>Doug Siebert wrote:
>[snip]
>> It will be interesting to see "who wins" and "who loses" with SPEC99.
>> Does anyone have any gut feelings as to how that'll turn out? Or
>> at least know when we'll start seeing SPEC99 results? :)
>
> Compare the ratio of SPECint95 to STREAM copy bandwidth. The higher
> this ratio the bigger the negative impact of SPEC99 relative to
> competitor CPUs with smaller ratios will probably be (cache busting
> and all that).
>
> CPU SPECint95 STREAM copy Ratio
> (GB/s)
>
> 21264/500 27.7 1.077 26
> 21164/600 19.3 0.228 85
> PII/400 15.8 0.304 52
> PA8000/180 11.8 0.262 45
> R10000/250 14.7 0.332 44
> P2SC/160 8.6 0.779 11
> PPC G3/266 12.4 0.137 91
> PPC 604/200 8.0 0.083 96
> USII/200 7.9 0.229 35
>
>
> The POWER and EV6X will probably do relatively better on SPEC99 while
> PowerPC and EV5 will come out losers. The rest will probably stay
> roughly about the same relative position.
>
> I haven't seen any STREAM number for the PA-8500 but my understanding
> is it uses the old HP runway bus (probably ~0.35 GB/s or less on copy)
> so it would likely to come out a loser.
This table, based on CPUs and not listing the associated chip sets, seems
somewhat misleading.
Maynard
>Perhaps the real competition to predication will be Simultaneous
>Multi-Threading. If the fetch unit creates separate threads for a branch
>taken and not taken, then speculative execution can continue for both the
>branch taken and not taken. When the branch instruction finally executes,
>one of the two threads will be deleted. The challenge would be to develop
>heuristics that would allow each thread to make progress in proportion to
>its predicted probability of surviving the execution of the branch
>instruction. Branch prediction would evolve into intra-thread resource
>scheduling.
Oh get real.
Predication is used for branches of the form
if(a==0){
b=c+d;
}else{
b=c-d;
}
It's not used for branches of the form
if(a==0){
perform 1000 instructions
}else{
perform 1000 different instructions
}
If your CPU is capable of spawning a new thread and the associated book
keeping in less time than is lost to a missed branch, it is grossly
misuing resources.
Maynard
>A more general advantage of SMT is that non-speculative execution can be
>given priority over speculative execution dynamically. The fetch unit can
>give preference to those threads that are not attempting to speculatively
>fetch past a branch instruction. A thread which is not at risk of being
>deleted by a pending branch instruction can be given preference over threads
>that are executing in the speculative shadow of a branch instruction.
>Speculation can occur only when there are sufficiently idle resources,
>increasing actual throughput.
>
>The liability of predication is that it statically consumes resources that
>might have a better use.
>
Thanks for the clarification, I had assumed that this would be the case.
[snip]
> An alternate (faster) schedule in IA-64 is as follows:
>
> sub bTmp = b, a // speculatively sub from b (into temp)
> add b = b, a // and add to b
> cmp.lt pLT, pNLT = a, 0
> ;;
> (pLT) add c = c, b // uses of b [then]
> (pLT) add d = d, b
> (pNLT) add c = c, bTmp // uses of b (temp) [else]
> (pNLT) add d = d, bTmp
> (pNLT) mov b = bTmp // move bTmp to b [else]
> ;;
>
> This takes 8 instructions in 2 cycles and one extra register. The final
> move of bTmp to b can be eliminated if b isn't live out at that point.
Thanks again!
You've both verified my conjecture about the predicate latency problem,
and the solution I guessed at, that you might have to end one of the
predicated blocks with a MOV to merge the results. :-)
It is obvious that to be able to hide the predicate latency, you must
fill at least the next cycle with useful work from both paths of
execution, sometimes doing the same work twice so you can keep the
dependency chains independent.
In a simple if () block, this will be much harder, because if the block
is selected then there will be no else {} block which can be executed in
parallel while waiting for the predicate result.
> My fault. I am messing this up badly. The system bus
> (point-to-point path to main memory) is 64-bit and runs at
> 333 MHz.. 2.0 GByte/sec.
All right, we're talking main memory. 333MHz is pretty respectable,
and I assume it's really 167MHz DDR. So, did you in fact mean
DDR-SDRAM (Synchronous Dynamic), as the poster a few messages back
suggested?
> Unless this gets better in future 21264 chipsets/motherboards
> I believe that STREAM won't get better.
One obvious thing to do might be to increase width? Previous alphae
have had 128bit memory busses, no? Perhaps the point-to-point thingie
makes this prohibitive?
> Regarding Pentiums.. what is the system bus width and speed,
If this is a non-rhetorical question, all Pentiums have 64bit busses,
new PIIs at 100MHz using SDRAM, older PIIs and all Pentiums and
Celerons at 66MHz using SDRAM or EDO. (Really, really ancient
Pentiums running the bus at 50, 60 or 66MHz.)
> Given a 500Mhz 21264 gets 23.6 SpecInt95_BASE and 27.7 SpecInt95_PEAK
> in the "good" CEC design, I find it hard to believe that a 33% increase
> in clock is going to deliver between a 80-100% speedup in SpecInt95
> performance! Either the Compaq compiler guys just were terrible
> (which I doubt) or something is quite wrong here. Even with faster and
> larger caches this does not compute for SpecInt95 :).
>
I'm not going to comment on specific numbers, but going off of base
and comparing to peak is a little fishy. :) I believe there is a lot
left on the floor personally. When Profile-Me gets tuned in and the
tools catch up, I think the compiler guys will be able to sqeeze more
performance out of the thing.
Aaron Spink
not speaking for Compaq
--Chris
My own views.
Paul DeMone wrote:
>
> Compare the ratio of SPECint95 to STREAM copy bandwidth.
Your ratio is probably as useless as SPECint95/Mhz ratio!
+---------------------------------------------------------------------+
| Paul A. Jacobi Phone: (603) 884-1948 |
| Compaq Computer Corporation FAX : (603) 884-0189 |
| OpenVMS Systems Group, ZKO3-4/U14 Email: jac...@star.enet.dec.com |
| 110 Spitbrook Road |
| Nashua, NH 03062-2698 |
| |
| To reply, remove "nospam-" from the return address. |
+---------------------------------------------------------------------+
What did the orginal poster request? A gut feel for who
would win and who would lose on SPEC99. Why don't you:
explain why SPECint95 to copy bandwidth ratio doesn't
at least partially reflect the degree to which a CPU's
performance is cache bound. Provide a alternative
back of an envelope type of metric and describe why it
is more suitable.
Because otherwise your posted reply was 10% opinion, 90%
sig, and 0% contribution to topic being discussed.
--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
dem...@mosaid.com architectures with MIPSed results but ALPHA's well
pde...@igs.net that ends well.
: What did the orginal poster request? A gut feel for who
: would win and who would lose on SPEC99. Why don't you:
: explain why SPECint95 to copy bandwidth ratio doesn't
: at least partially reflect the degree to which a CPU's
: performance is cache bound. Provide a alternative
: back of an envelope type of metric and describe why it
: is more suitable.
: Because otherwise your posted reply was 10% opinion, 90%
: sig, and 0% contribution to topic being discussed.
: --
: Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
: Kanata, Ontario proportions to put more PRECISION and POWER into
: dem...@mosaid.com architectures with MIPSed results but ALPHA's well
: pde...@igs.net that ends well.
Is SPECint99 only going to include cache unfriendly tests which involve
accessing memory in predictable sequential patterns as in STREAM, or are
there going to be tests with more unpredictable, irregular patterns
involved?
Anil
> Is SPECint99 only going to include cache unfriendly tests which involve
> accessing memory in predictable sequential patterns as in STREAM, or are
> there going to be tests with more unpredictable, irregular patterns
> involved?
If spec99 depends on easily predictable mass bandwidth, vendors will
just add a "read xKB real quick now into cache" instruction for read
performance and a big write gathering device to improve write
write performance.
But I think we can be relatively sure that the spec guys are much
smarter than that.
> Anil
--
Sander
There is no love, no good, no happiness and no future -
all these are just illusions.
Similar to "Threaded Multiple Path Execution" Wallace, Calder, Tullsen,
ISCA98.
Which combines SMT, eager execution, and some policy of when to perform
eager execution.
> A more general advantage of SMT is that non-speculative execution can be
> given priority over speculative execution dynamically. The fetch unit can
> give preference to those threads that are not attempting to speculatively
> fetch past a branch instruction. A thread which is not at risk of being
> deleted by a pending branch instruction can be given preference over threads
> that are executing in the speculative shadow of a branch instruction.
> Speculation can occur only when there are sufficiently idle resources,
> increasing actual throughput.
>
This fetch policy was discussed in "Exploiting Choice: Instruction Fetch
and Issue on an Implementable Simulateous Multithreading Processor"
Tullsen,
Eggers, Emer, Levy, Lo, Stamm, ISCA96.
But results showed that other policies worked better (such as
fetching from the thread with the least number of active instructions).
The reduced penalty of branch misprediction in SMT, plus the fairly
good levels of branch prediction, makes the policy of prioritizing
non-speculative threads less important.
I'd say the general advantages of SMT are:
memory latency tolerance, exposing ILP, internal latency tolerance,
reduced wasted resources.
> The liability of predication is that it statically consumes resources that
> might have a better use.
>
Predication can increase ILP and utilize resources that might
have been idle. SMT can also increase ILP and utilize resources, but
with
potentially less wasted execution bandwidth. But then SMT has its
own set of disadvantages as well.
Jim Burns
> Is SPECint99 only going to include cache unfriendly tests which involve
> accessing memory in predictable sequential patterns as in STREAM, or are
> there going to be tests with more unpredictable, irregular patterns
> involved?
>
I'm not commenting on any specifics in Spec99 because there currently
isn't any spec99 but, historically each iteration of Spec has had a
large effective working set and been more data intensive. Certainly
the floating point codes with have a bigger footprint and I would
imagine that the integer code would have "bloated" up as well.
In that regard, looking at stream as a measuring stick as to how an
individual machine will be effected by code that may not be as cache
friendly(bigger working set, weird access paterns, etc) is a usable
idea.
aaron
: In that regard, looking at stream as a measuring stick as to how an
: individual machine will be effected by code that may not be as cache
: friendly(bigger working set, weird access paterns, etc) is a usable
: idea.
But STREAM doesn't involve weird access paterns. If you do a load
that isn't satisfied from the cache, you retrieve the data to fill
a cache block from memory. But in the process you are loading
adjacent data into the cache that will satisfy a load in the immediate
future. I would imagine that this a common access patern for applications
dealing with large sets of floating point data. However, for applications
dealing with integer data, there is probably a significant group which
will access random data that wasn't in cache, at the same time pulling
data into cache that won't be used. Furthermore, they may use that data
to influence what other data they will pull into memory, introducing
significant latency effects.
: aaron
> But STREAM doesn't involve weird access paterns. If you do a load
> that isn't satisfied from the cache, you retrieve the data to fill
> a cache block from memory. But in the process you are loading
> adjacent data into the cache that will satisfy a load in the immediate
> future. I would imagine that this a common access patern for applications
> dealing with large sets of floating point data. However, for applications
> dealing with integer data, there is probably a significant group which
> will access random data that wasn't in cache, at the same time pulling
> data into cache that won't be used. Furthermore, they may use that data
> to influence what other data they will pull into memory, introducing
> significant latency effects.
Your characterization of floating point programs isn't very
good. There are some programs with nice linear access patterns and
some without.
-- g
: Your characterization of floating point programs isn't very
: good. There are some programs with nice linear access patterns and
: some without.
: -- g
I said it was a common pattern for floating point programs, not that
it was the only pattern. The problem is that STREAM uses a nice linear
access pattern. Getting back to the issue of using a SPECint95 to STREAM
COPY ratio as an indicator of who will win and who will lose and by
how much on SPECint99, that would only be valid if the new tests had
plenty of linear access. But integer applications are more varied and
many have signficantly less linear access but are quite cache
unfriendly, so a lot of the potential bandwidth as measured by STREAM
actually won't be achieved, and some processors could end up wasting
more bandwidth than others.
The other issue is if SPECint99 is to be a representative suite, should
it include more cache unfriendly, latency sensitive operations?
Anil
> I said it was a common pattern for floating point programs, not that
> it was the only pattern.
OK. And it's a common access pattern for integer programs, too. So
integer and floating point programs are the same in that respect. You
said they were different.
> The problem is that STREAM uses a nice linear
> access pattern.
That's not a problem. That's simply what it does. If you want to
measure random accesses, lmbench does that kind of thing.
> The other issue is if SPECint99 is to be a representative suite, should
> it include more cache unfriendly, latency sensitive operations?
You seem to have mistaken SPECint99 for a microbenchmark. SPECint99 is
composed of applications.
-- g
I think the problem is terminology. He said "integer," and meant
"commercial data processing." (I think.)
> > The problem is that STREAM uses a nice linear
> > access pattern.
>
> That's not a problem. That's simply what it does. If you want to
> measure random accesses, lmbench does that kind of thing.
A little lighter here, I think... There's a natural tendency for people
to refer to things as "problems" when what they mean is "it doesn't do
what I want it to do." For them it is a problem, but there's a bit of
misplaced universality.
So, we could get back to actual discussion if "The problem with STREAM
is that ..." is replaced with "For the applications I'm interested in,
STREAM isn't representative because ..."
Greg Pfister
<not my employers' opinion>
> So, we could get back to actual discussion if "The problem with STREAM
> is that ..." is replaced with "For the applications I'm interested in,
> STREAM isn't representative because ..."
That would be fine with me. Even though I mostly do floating point
applications, I have many such applications which are either (1) cache
blocked too well, so that STREAM is pessimistic, or (2) have a random
access pattern, so that STREAM is optimistic. STREAM only does one
thing. It does it well, but everyone who thinks about it ought to
understand what it's measuring.
I suspect that lmbench already measures the kind of micro-benchmark
things that the poster was after.
-- g