Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Geometric Mean or Median

140 views
Skip to first unread message

Alfred A. Aburto

unread,
Aug 9, 1992, 4:10:16 PM8/9/92
to
-------
I was thinking that the Median would be a more reliable measure of
performance since it is not as sensitive to what is happening at the
tails of a cumulative distribution of samples (whatever it may be).

For example suppose we have the following MFLOPS results for 5
programs:

Case MFLOPS Arithmetic Geometric Median
1, 2, 3, 4, 5 Mean Mean
-------------------
A 4, 10, 3, 6, 7 6.0 5.5 5.5
B 4, 10, 3, 6, 100 24.6 9.4 5.5

In this made up example, Case A is with the QCC Compiler V1.4, say,
and Case B is with QCC Compiler V8.2 where an optimization was applied
that effectively reduced the 5'th program to a trivial result and
thus made it an invalid test of MFLOPS (which means in Case B the 5'th
data sample should be thrown out).

Well, as you can see the Median was least sensitive to that type
of error --- it gave the same result in both cases. I took the
median as the first crossing through the 50% point, and then a
linear interpolation between two points to get the 50% (Median)
MFLOPS rating.

While the Geometric Mean is less sensitive to extreme data values
(I want to say 'outliars' but I don't think it is a word) than the
Arithmetic Mean, it is the Median which yields the more reliable
results.

I would recommend using the Median as a summary of data of the same
type (MFLOPS, or MIPS, SPECint, SPECflt, ..., whatever) instead of
the Arithmetic Mean, Geometric Mean, Harmonic Mean, .... etc..

Of course, I would also like to see all the numbers archived so that
I could also calculate those other measures of performance or Kurtosis
or correlation or many other things ...

Al Aburto
abu...@marlin.nosc.mil

-------


Dan Prener

unread,
Aug 9, 1992, 11:06:48 PM8/9/92
to
In article <1992Aug9.2...@nosc.mil> abu...@nosc.mil (Alfred A. Aburto) writes:

>I was thinking that the Median would be a more reliable measure of
>performance since it is not as sensitive to what is happening at the
>tails of a cumulative distribution of samples (whatever it may be).

[... example omitted ...]

>While the Geometric Mean is less sensitive to extreme data values
>(I want to say 'outliars' but I don't think it is a word) than the
>Arithmetic Mean, it is the Median which yields the more reliable
>results.

The median is far too insensitive to extreme data values. Suppose
we have a reference sample, A, and on sample B the execution time
on the fastest 40% of the programs has been cut in half, compared
to sample A. Then the median would remain unchanged.


--
Dan Prener (pre...@watson.ibm.com)

Jeffrey Reilly

unread,
Aug 10, 1992, 9:22:03 PM8/10/92
to
In article <1992Aug9.2...@nosc.mil> abu...@nosc.mil (Alfred A. Aburto) writes:
>-------
>I was thinking that the Median would be a more reliable measure of
>performance since it is not as sensitive to what is happening at the
>tails of a cumulative distribution of samples (whatever it may be).
>
>For example suppose we have the following MFLOPS results for 5
>programs:
>
>Case MFLOPS Arithmetic Geometric Median
> 1, 2, 3, 4, 5 Mean Mean
> -------------------
>A 4, 10, 3, 6, 7 6.0 5.5 5.5
>B 4, 10, 3, 6, 100 24.6 9.4 5.5
>
>In this made up example, Case A is with the QCC Compiler V1.4, say,
>and Case B is with QCC Compiler V8.2 where an optimization was applied
>that effectively reduced the 5'th program to a trivial result and
>thus made it an invalid test of MFLOPS (which means in Case B the 5'th
>data sample should be thrown out).
>
>Well, as you can see the Median was least sensitive to that type
>of error --- it gave the same result in both cases. I took the
>median as the first crossing through the 50% point, and then a
>linear interpolation between two points to get the 50% (Median)
>MFLOPS rating.
>

Suppose the optimization to program 5 in case B was a legitimate
optimization that applied across a class of similar applications?
As a vendor, end-user, etc, I would think you would like to see this
reflected in whatever composite measure is used. I would argue that this
method appears too insensitive.

What definition of median are you using?

>While the Geometric Mean is less sensitive to extreme data values
>(I want to say 'outliars' but I don't think it is a word) than the
>Arithmetic Mean, it is the Median which yields the more reliable
>results.

What do you mean by "reliable"?

>
>I would recommend using the Median as a summary of data of the same
>type (MFLOPS, or MIPS, SPECint, SPECflt, ..., whatever) instead of
>the Arithmetic Mean, Geometric Mean, Harmonic Mean, .... etc..

Regarding current standards, the SPEC metrics (SPECint92, SPECfp92, etc) are
DEFINED as being the geometric mean of the appropriate SPEC suites
SPECratios.

What is it they say, "with statistics anything can be proved"? :-)

Each method has it virtues (for computer performance summary, I often reference
chapter 2 (pages48-53 especially) of Hennessy and Patterson's Computer
Architecture text for pros and cons of different methods.). For individual
purposes, use whatever you feel is technically appropriate. For standards
groups (such as SPEC), much work and analysis is done to select the measure
that best fits (and unfortunately, you can't please everyone... :-).

John Mashey of Silicon Graphics often recommends a statistics/marketing book
when threads of this nature pop up. Hopefully, when he returns from Hot
Chips, he would be kind enough to repost the title...(or anyone else if they
recall the title).

Jeff

Jeff Reilly | "There is something fascinating about
Intel Corporation | science. One gets such wholesale returns
jwre...@mipos2.intel.com | of conjecture out of such a trifling
(408) 765 - 5909 | investment of fact" - M. Twain
Disclaimer: All opinions are my own...

Patrick F. McGehearty

unread,
Aug 11, 1992, 11:06:15 AM8/11/92
to
In article <12...@inews.intel.com> jwre...@mipos2.UUCP (Jeffrey Reilly) writes:
>In article <1992Aug9.2...@nosc.mil> abu...@nosc.mil (Alfred A. Aburto) writes:
>>-------
>>I was thinking that the Median would be a more reliable measure of
>>performance since it is not as sensitive to what is happening at the
>>tails of a cumulative distribution of samples (whatever it may be).
... discussion omitted

>
>John Mashey of Silicon Graphics often recommends a statistics/marketing book
>when threads of this nature pop up. Hopefully, when he returns from Hot
>Chips, he would be kind enough to repost the title...(or anyone else if they
>recall the title).
>
I don't know what title John recommends, but I find "The Art of Computer
Systems Performance Analysis" worth having on my shelves.
A couple of particularly interesting sections are:
2.1 "Common Mistakes in Performance Evaluation"
9.3 "Common Mistakes in Benchmarking"
and
10 "The Art of Data Presentation"

The information in these sections have much practical value which was
not covered by my formal coursework. I review it from time to time
just to remind myself of pitfalls to avoid.

- Patrick McGehearty

Sam Fuller

unread,
Aug 11, 1992, 2:46:45 PM8/11/92
to
In article <12...@inews.intel.com>, jwre...@mipos2.intel.com (Jeffrey Reilly) writes:
|> In article <1992Aug9.2...@nosc.mil> abu...@nosc.mil (Alfred A. Aburto) writes:
|> >-------
|> >I was thinking that the Median would be a more reliable measure of
|> >performance since it is not as sensitive to what is happening at the
|> >tails of a cumulative distribution of samples (whatever it may be).
|> >
|>
|> >While the Geometric Mean is less sensitive to extreme data values
|> >(I want to say 'outliars' but I don't think it is a word) than the
|> >Arithmetic Mean, it is the Median which yields the more reliable
|> >results.
|>

Both the arithmetic mean and the median ( sort the results and take the
middle value) assume that all of your units are the same. The geometric mean
does not make this assumption. I picture it as representing the performance
space as a volume and then taking the nth root of that volume as a 'geometrically'
normalized representation of the volume.

For example the old SPECint value could be thought of as a four-dimensional cube with
the extent of each dimension represented by one of the four reference times.
if we let W = gcc1.37 ratio
X = espresso ratio
Y = li ratio
Z = eqntott ratio

Then W*X*Y*Z = V = Volume of the SPECint performance 4-rectangle.
And the 4th root of V would represent the length of a vertice on a 4-cube with the
same volume as the original 4-rectangle. The value of SPECint is in the same
range of the individual reference values, thats a nice side-effect. But the important
thing to me is the notion of a performance space rather than a performance line. The
volume allows you to combine quantities with different units and still produce meaningful
results. Remember Physics 101. (ft*lb/m2). The problem with arithmetic means and medians
is that they make the assumption that all quantities are made up of the same units. I
don't think that that is true for computer performance in general. Certainly, integer and
floating point performance are two different dimensions of computer performance!

Sam
--
=================================================================
Sam Fuller sa...@perform.dell.com
Advanced Systems Development (512) 338-8436
Dell Computer Corporation 9505 Arboretum Blvd, Austin, TX 78759
=================================================================

Alfred A. Aburto

unread,
Aug 11, 1992, 9:26:20 PM8/11/92
to

-------
Ah, that was a good point! I was not thinking of the Median as being
TOO insensitive to extreme data values. I was thinking more from the
sense of reliability of the measure of performance, from the consumer
angle, --- protecting the measure of performance from extreme outlying
data points (as in the example I cited, which I thought was important,
but instead I find people telling me the Median is TOO insensitive, of
all things). I was somewhat surprised by this as it was 180 degrees
out of phase from my thinking and from the response I was expecting
(boy, did I fall out of the turnip truck yesterday or what <grin>).

Dispite the correctness of what you say I still stick with the Median.

I cited one example where one of 5 MFLOPS estimates was an extreme
value (actually erroneous) and showed the Median was insensitive to
that fact (did not change) compared to the Arithmetic Mean (which
increased greatly) and Geometric Mean (which increased by almost a
factor of 2). I thought this was a good example illustrating why one
should us the Median. There are real world examples too as Reinhold
Weicker indicated in msg #2160 of comp.benchmarks (relative to the
old 030.matrix300 program in the SPEC 1989 suite --- use of the Median
would have helped avoid problems --- of course attempting to be a
careful analyst one would have caught that error anyway and removed
the program, as in fact was actually done).

You cited another example where the two highest MFLOPS numbers in the
set of 5 increased by a factor of two. In this case also, the Median
did not change. You claim therefore that the Median is TOO insensitive.
I disagree. The reason is that, in your example, 60% of the MFLOPS
numbers did not change at all. In your example the Geometric Mean would
be biased upward even though 60% of the data did not change. So with an
eye toward what the consumer might really get (3 out of 5 samples show
no improvement) I would still stick with the Median value even in this
case --- I would not want to see the number 'biased' upward by the
Geometric mean in this case (by 2 out of 5 samples). In my view, being
cautious and attempting to be fair, the 3 out of 5 counts more than the
2 out of 5 and I would go with the Median. I would definitely stick with
the Median in the example I gave in the previous paragraph.

There are two views really. On the one hand there is the Vendor and on
the other there is the consumer. If we force the Vendor to use the Median
then it will be much more difficult for the Vendor to demonstrate
performance improvement by tweaking one (or a few) programs. This is
exactly what happened in the two examples given above --- the Median did
not change. However, I hope I have illustrated why from the viewpoint of
the consumer that it is really not a bad idea to use the Median.

I'm not strongly against the use of the Geometric Mean (I have advocated
its use too for a long time). I'm just trying to figure out which of
the two measures of performance, Geometric Mean or Median, best
characterizes all the data samples. I think, via the examples put forth,
that it is the Median.

Al Aburto
abu...@marlin.nosc.mil

-------


Patrick F. McGehearty

unread,
Aug 12, 1992, 12:40:39 PM8/12/92
to
In article <1992Aug11.1...@news.eng.convex.com> pat...@convex.COM (Patrick F. McGehearty) writes:
>I don't know what title John recommends, but I find "The Art of Computer
>Systems Performance Analysis" worth having on my shelves.

I neglected to mention that Raj Jain is the author of "The Art of Computer
Systems Performance Analysis", and it is available from Wiley Press.

I'm told John Mashey likes to recommend "The Visual Display of Quantitative
Information" by Edmund Tufte, and "How to Lie with Statistics" by Huff.
I am not familiar with those books, but the titles sure sound relevant
to performance work.

Eugene N. Miya

unread,
Aug 12, 1992, 1:22:09 PM8/12/92
to
In article <1992Aug12....@nosc.mil> abu...@nosc.mil

(Alfred A. Aburto) writes:
>Ah, that was a good point! I was not thinking of the Median as being
>TOO insensitive to extreme data values. I was thinking more from the
>sense of reliability of the measure of performance, from the consumer

Well this is why a person should NOT
1) rely on one number like a mean or a median
and should
2) consider other statistics like variance, or the mean (arithmetic,
harmonic, or geometric) of a completely different measure.

Thanks for the defense while I was on vacation. Now if I can get some time
to work on the FAQ.....

--eugene miya, NASA Ames Research Center, eug...@orville.nas.nasa.gov
Resident Cynic, Rock of Ages Home for Retired Hackers
{uunet,mailrus,other gateways}!ames!eugene
Second Favorite email message: Returned mail: Cannot send message for 3 days
A Ref: Mathematics and Plausible Reasoning, vol. 1, G. Polya

Alfred A. Aburto

unread,
Aug 13, 1992, 10:00:25 PM8/13/92
to
In article <12...@inews.intel.com> jwre...@mipos2.UUCP (Jeffrey Reilly) writes:

>Suppose the optimization to program 5 in case B was a legitimate
>optimization that applied across a class of similar applications?
>As a vendor, end-user, etc, I would think you would like to see this
>reflected in whatever composite measure is used. I would argue that this
>method appears too insensitive.

This is a good question too. The thing is, we want the overall measure
of performance to be more responsive or representative of the bulk of
the data. We don't want the overall measure of performance to be biased
by one or a few extreme outlying data points because they are not
representative of the bulk of the data, they are low probability
occuring events (correct or incorrect). You know all this, but that's
the reason. I believe the Median does this, but it looks like I have
some explaining to do about the definition of the Median.

The next thing we want to do is define some parameters that describe
(indicate, or give us an idea) what is happening at the tails of the
distribution (cumulative distribution) because, as you indicate, this
is where the really interesting things are happening. I would propose
using in addition to the Median the Minimum and Maximum values, or
the 10% and 90% points of the cumulative distribution for example.
Three numbers is about 'right' I think. If we give alot of different
measures of performance then people will pick out of that set one or
a few perhaps that best fit their case for example and not everyone
will pick the same measures of performance and thus things can get
confused. We don't want to prevent people from learning more though,
so in addition we provide (make available) all the raw data so people
can just go crazy analyzing ever which way and this is 'good' as some
interesting things are likely to be learned.

To arrive at the above in a meaningful way we'll need 'good' validated
data sets consisting of a sufficient number of samples (program results,
routine results, kernel results, module results, ..., whatever
constitutes a data sample). We want to pick enough samples so at least
the cumulative distribution is relatively smooth (no big jumps in the
distribution). I don't know how many samples is required, but I think
that 10 isn't enough. Maybe 25 or 50 might be enough but we'd need to
figure that out and it can be done. If the fluctuation is too big we
might consider smoothing (filtering) the data. There are many
possibilies to handle these problems (a simple binomial filter or
Savitsky-Golay filter, or ...), but we need a sufficient number of
validated samples first.

>What definition of median are you using?

Here is where we've had a problem (as it finally dawned on me). The
median is defined as the 50% point of the cumulative distribution of
samples. This is the definition I have been using, and it is the
preferred way of deriving the Median from a set of samples. With
small data samples it is subject to error as are all the other
parameters. If the distribution of samples is not highly skewed you
will find that the Median and Arithmetic Mean are very close or
identical. See Papoulis for example: "Probability, Random Variables,
and Stochastic Processes".

>What do you mean by "reliable"?

It is a measure of performance that is more representative of the
main bulk of data and it is not easily biased by one or a few
low probability of occurence events (data samples). The Mode might
be even more useful and it can be inferred from the Arithmetic Mean
and Median. Perhaps all these problems might vanish if the sample
size were increased so that we'd get a better picture of the true
distribution.

>Regarding current standards, the SPEC metrics (SPECint92, SPECfp92, etc)
>are DEFINED as being the geometric mean of the appropriate SPEC suites
>SPECratios.

Yes, and I think the Median might turn out to be more 'reliable', but
also we need more data samples to better estimate the true sample
distribution. Also we might want to consider the confidence of our
estimates, particularly since the goal of deriving measures of
performance is to compare two or more of them.

>What is it they say, "with statistics anything can be proved"? :-)

Yes, numbers can tell lies and we must be careful.
Statistics is really Ok. The problems happen when we try to make claims
with ill formed data sets --- with samples that do not adequately
describe the true distribution it is certainly true that one can derive
all sorts of wild and meaningless results (all lies).

[Comments about references left out]

>Jeff Reilly | "There is something fascinating about
>Intel Corporation | science. One gets such wholesale returns
>jwre...@mipos2.intel.com | of conjecture out of such a trifling
>(408) 765 - 5909 | investment of fact" - M. Twain
>Disclaimer: All opinions are my own...


Al Aburto
abu...@marlin.nosc.mil
-------

spencer shafer

unread,
Aug 14, 1992, 10:21:26 AM8/14/92
to

Having joined this in midstream, I risk repeating some information. If
so, my apologies in advance. A discussion of this, and an offered proof
of the geometric mean as preferred method is in the March 1986 issue of
Communications of the ACM, "How Not to Lie With Statistics: The Correct
Way to Summarize Benchmark Results," by Fleming and Wallace.

DrLove
--
/\__________________________________________________________________/\
|| SPENCER SHAFER ||
|| email:sha...@handel.cs.colostate.edu Work Phone: 303-498-0901 ||
\/------------------------------------------------------------------\/

Hugh LaMaster

unread,
Aug 14, 1992, 11:58:57 AM8/14/92
to
In article <Aug14.142...@yuma.ACNS.ColoState.EDU>, sha...@CS.ColoState.EDU (spencer shafer) writes:
|>
|>
|> Having joined this in midstream, I risk repeating some information. If
|> so, my apologies in advance. A discussion of this, and an offered proof
|> of the geometric mean as preferred method is in the March 1986 issue of
|> Communications of the ACM, "How Not to Lie With Statistics: The Correct
|> Way to Summarize Benchmark Results," by Fleming and Wallace.

Yes, and there was a rebuttal to this "proof" in CACM by, I believe,
J.E. Smith, in October of 1988. {If I have the reference correct,}
it is proved that the harmonic mean is the correct measure of rates,
if you want to examine a fixed workload and characterize the performance
on that workload. Fleming and Wallace missed the point, IMHO, of a lot
of things, including Amdahl's law. Always use the harmonic mean of rates
unless you have a specific reason not to.

I was glad to see that Gordon Bell provided the "LFK(hm)" column in the
performance summary of his recent article in CACM, since this is the
harmonic mean of an untuned workload of Livermore Loops. This is the one
of the many possible measures, which I find most meaningful when comparing
raw floating point performance. It tends to be pessimistic with respect
to tuned rates on vector machines by perhaps a factor of 3 or 4, but it
is rather realistic when looking at untuned rates.

("raw" -- that is, a first cut, before you do some "real" benchmarking :-)

--
Hugh LaMaster, M/S 233-9, UUCP: ames!lamaster
NASA Ames Research Center Internet: lama...@ames.arc.nasa.gov
Moffett Field, CA 94035-1000 Or: lama...@george.arc.nasa.gov
Phone: 415/604-1056 #include <usenet/std_disclaimer.h>

David Hinds

unread,
Aug 14, 1992, 12:52:47 PM8/14/92
to
In article <1992Aug14....@riacs.edu> lama...@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>In article <Aug14.142...@yuma.ACNS.ColoState.EDU>, sha...@CS.ColoState.EDU (spencer shafer) writes:
>|>
>|>
>|> Having joined this in midstream, I risk repeating some information. If
>|> so, my apologies in advance. A discussion of this, and an offered proof
>|> of the geometric mean as preferred method is in the March 1986 issue of
>|> Communications of the ACM, "How Not to Lie With Statistics: The Correct
>|> Way to Summarize Benchmark Results," by Fleming and Wallace.

Has anyone tried to take all the available SPEC numbers, and do a factor
analysis, to see if there is a statistically meaningful small set of
numbers that can be used to predict the performance on all the tests?
One would hope that the factors that fell out would naturally fit
different architectural parameters -- scaler integer speed, scaler
floating point speed, vector performance, etc. You would also get the
weights of each factor for each SPEC test, and users could estimate
performance of their own codes on new machines, by running on a few old
machines, and using the published SPEC factors for those machines to
calculate the weights for their codes.

- David Hinds
dhi...@allegro.stanford.edu

Charles Grassl

unread,
Aug 14, 1992, 4:12:45 PM8/14/92
to
In article <1992Aug14....@riacs.edu>, lama...@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>In article <Aug14.142...@yuma.ACNS.ColoState.EDU>, sha...@CS.ColoState.EDU (spencer shafer) writes:
>|>
>|> A discussion of this, and an offered proof
>|> of the geometric mean as preferred method is in the March 1986 issue of
>|> Communications of the ACM, "How Not to Lie With Statistics: The Correct
>|> Way to Summarize Benchmark Results," by Fleming and Wallace.
>
>Yes, and there was a rebuttal to this "proof" in CACM by, I believe,
>J.E. Smith, in October of 1988. {If I have the reference correct,}
>it is proved that the harmonic mean is the correct measure of rates,
>if you want to examine a fixed workload and characterize the performance
>on that workload.

The references are below:

[FL,WA] Fleming, P.J., Wallace, J.J, "How Not to Lie With Statistics:
The Correct Way to Summarize Benchmark results",
Communications of the ACM, P. 218-221, March, 1986, Volume 29,
no. 3.

[SM] Smith, J.E., "Characterizing Computer Performance With a Single
Number", Communications of the ACM, P. 1202-1206, October,
1988, Volume 31, no. 10.

[GU] Gustafson, J. et. al., "SLALOM", Supercomputing Review, P.
52-59, July, 1991.

In [FL,WA], Fleming and Wallace advocate the use of a geometric mean
for characterizing computer performance based on benchmarks. In [SM],
Smith advocates the use of a harmonic mean, though he states that "the
most obvious single number performance measure is the total time". The
total (elapsed) time is not only accurate, but has considerable
intuitive appeal.

Neither of the articles, [FL, WAL] or [SM], offer "proofs" in the
mathematical sense. (If Smith's "proof" is correct, then is Fleming's
and Wallace's "proof" incorrect?) Why do two articles advocate
different metrics? The answer lies in the underlying assumptions in
each article.

Fleming and Wallace stress the the geometric mean only applies to
normalized performance results. The assumption that individual results
are normalized leads to the use of the geometric mean. Smith, in his
article, assumes that "work" is measured by floating point operations
and that these operations are all equivalent pieces of the workload.
This assumption leads to the use of a harmonic mean.

The article have two distint and different assumptions:
1. Results normalized to a specific machine [FL,WA]
2. Work is measured by floating point operations [SM]

Some benchmarks fit assumption (1) above. Some benchmarks fit
assumption (2) above. Some benchmarks do not fit either assumption.

Not all benchmark tests, especially those with a broad range of
performance characteristics, have realistic machines to normalize
against. For example, a VAX 11/780, which is used for normalization of
the original SPEC benchmarks, is not appropriate for normalizing
performance of large floating point simulations. We might ask, is the
VAX 11/780 reasonable for calibrating RISC workstations?

Not all computer "work" is measured by the number of floating point or
integer operations. For example, the SLALOM benchmark [GU] does not
have an accurate operation count. The authors of this benchmark do not
count the number of floating point operations performed, rather, speed
is measured by the number of "patches" covered in one minute of
computation. Different algorithms have different numbers of
operations, but as long as the same number of patches are computed in
one minute, the speed is judged to be the same.

It is the situation, or the constraints, of a particular benchmark with
dictates the proper summarizing statistic. The table below lists the
interpretation of various means. (Note that the usual referred to
harmonic mean is often a -uniform- harmonic mean. Smith, in his
article [SM], emphasizes the the use of weighted harmonic means.)

Geometric mean: A measure of the distance in "performance
space" from the reference machine to the
tested machine.

(Uniform) harmonic mean: The average performance if all benchmarks
were adjusted so that each performed the
same number of floating point operations.

(Uniform) arithmetic mean: The average performance if all benchmarks
were adjusted so that each ran for the
same amount of time.

Charles Grassl
Cray Research, Inc.
Eagan, Minnesota USA

Eugene N. Miya

unread,
Aug 20, 1992, 12:03:52 PM8/20/92
to
>A discussion of this, and an offered proof
>of the geometric mean as preferred method is in the March 1986 issue of
>Communications of the ACM, "How Not to Lie With Statistics: The Correct
>Way to Summarize Benchmark Results," by Fleming and Wallace.

Personally, I think the "proof" is weak. Worlton has an earlier, less
frequently cited paper saying the harmonic mean is the way to go. The above
authors didn't do enough literature checking. If the arithmetic, the geometric,
and the harmonic mean are all suspect, then don't trust any of them.

--eugene miya, NASA Ames Research Center, eug...@orville.nas.nasa.gov
Resident Cynic, Rock of Ages Home for Retired Hackers
{uunet,mailrus,other gateways}!ames!eugene
Second Favorite email message: Returned mail: Cannot send message for 3 days

Ref: J. Worlton, Benchmarkology, email for additional ref.

Alfred A. Aburto

unread,
Aug 23, 1992, 7:43:09 AM8/23/92
to

-------
The thing we need to realize is that 'benchmark' results are random
numbers :-)

If we keep in mind that they are really random numbers, then this might
help avoid some of the typical problems that occur when comparing
'benchmark' results. It may help us realize that we can not take two
isolated results and compare them since the measurements, and the
measures of performance (means, medians, ..., etc.) are noisy and
subject to a generally unknown error. This is true of the SPEC suite
just as much as it is true for Dhrystone, Whetstone, Linpack and all
the rest. The SPEC results, it would appear, are more reliable because
they consist of the geometric mean of a number of different results.
However, even the SPEC results will vary, as will any measure of
performance, when the underlying test parameters change. See SunWorld
magazine, Mar 1992, pg 48, "SPEC: Odyssey of a benchmark", where
geometric mean SPECmark ratings of 17.8, 20.0, 20.8, and 25.0 were
measured on the same machine using different compilers.

Well, I agree, since all our measures of performance are error prone,
we must be skeptical when comparing any isolated raw results. And we
must be careful when comparing mean results without some indication of
the magnitude of the error involved.

I have heard alot of Dhrystone 1.1 bashing in the past. I tried to
understand just how 'bad' Dhrystone was several years ago (it seems)
by correlating Dhrystone 1.1 results with SPECint89 results. I had
20 or so data points to work with. I thought perhaps there would be
little correlation and that Dhrystone really needed to be cast out
as a measure of performance, because of the greater confidence placed
in the SPEC results. Instead, they were highly correlated (0.92
correlation or so). Well, I had to revise my thinking. When I
plotted the results it was clear they were well correlated. There
were only 2 rather large and obvious differences between the
Dhrystone 1.1 and SPECint89 results. Dhrystone 1.1 showed a couple of
big 'spikes' in performance that were not present in the SPECint89
results (for an HP 68040 and i860 I believe). This result indicated
Dhrystone 1.1 was probably ok, but one needed a variety of results
(different systems, compilers, etc) to help filter out those 'spikes'
which appeared unrealistic of general integer performance as compared
to SPECint89. Also it showed that it was necessary to base performance
on a number of different test programs (as in SPEC) instead of just one
program.

Even using a number of different test programs might not help make
things more definite.

This was really 'brought home to me' recently by the posting and email
from Frank McMahon regarding the Livermore Loops MFLOPS results. The
Livermore Loops consists of 72 calculation loops typically found in
'big' programs. However, the MFLOPS results can show a huge variation
in performance with the very fast machines. Frank indicated that the
NEC SX-3 supercomputer showed a variation in performance from 3 MFLOPS
all the way to 3400 MFLOPS. The results were Poisson distributed. The
standard deviation was 500 MFLOPS. One wonders about the meaning of
any one or more measures of performance in this case. It is almost as
if it is necessary to look at individual cases very carefully instead
of some mean or median result. That is, I would not want to buy a
NEC SX-3 supercomputer if it turned out most of the work I'd be doing
corresponded to the low end performance of that system where a fast
68040 or 80486 might do just as well. Whew, what a nasty situation ---
the spread in performance is just too big.

Al Aburto
abu...@marlin.nosc.mil
-------


Clark L. Coleman

unread,
Aug 26, 1992, 12:02:40 PM8/26/92
to
In article <1992Aug23....@nosc.mil> abu...@nosc.mil (Alfred A. Aburto) writes:
>However, even the SPEC results will vary, as will any measure of
>performance, when the underlying test parameters change. See SunWorld
>magazine, Mar 1992, pg 48, "SPEC: Odyssey of a benchmark", where
>geometric mean SPECmark ratings of 17.8, 20.0, 20.8, and 25.0 were
>measured on the same machine using different compilers.

This is given as an example of "noisiness" or "error proneness" in SPEC,
but SPEC was intended to measure the SYSTEM, including the compilers. If
one vendor has better compilers than another, it matters to the end users.
Conversely, trying to benchmark the raw hardware in some way that filters
out compiler differences would not be interesting to people who have to
purchase the systems and use the compilers, not just the hardware.

The only real question in my mind about SPEC's approach is allowing vendors
to use different compiler switch settings for each individual benchmark,
if it produces better numbers. I don't think many users can compile every
program that many different times and run timing tests on each one.

>I have heard alot of Dhrystone 1.1 bashing in the past. I tried to
>understand just how 'bad' Dhrystone was several years ago (it seems)
>by correlating Dhrystone 1.1 results with SPECint89 results. I had
>20 or so data points to work with. I thought perhaps there would be
>little correlation and that Dhrystone really needed to be cast out
>as a measure of performance, because of the greater confidence placed
>in the SPEC results. Instead, they were highly correlated (0.92
>correlation or so). Well, I had to revise my thinking.

This kind of bogus correlation was debunked long ago by SPEC. As soon as
more data points are added, the correlation gets worse. Try adding the
SparcStation 10 numbers to your test, for example. For that matter, just
look at the HP 9000/710 versus the HP 9000/720. The only differences in
the hardware are the larger caches on the 720. Since the 710 has large
enough caches for the Dhrystone code, but not for some SPECint codes, it
produces the same Dhrystones as the 720 but significantly lower SPECint.

Similar poor correlations will be obtained for two different systems
with very different cache sizes. Compare the HP9000/720 to a smaller
cache machine like an IBM RS/6000 or Sun SS2. For example, here are some
Spring, 1991, numbers:

SPECint89 Dhrystone 1.1 MIPS MIPS/SPECint89
--------- ------------------ --------------
HP 9000/720 39.0 57 1.46
DEC 5000/200 19.0 24.2 1.27
IBM RS6000/550 34.5 56 1.62

If I didn't have SPECint89 numbers, but wanted to derive them from available
Dhrystone MIPS numbers, the third column above would indicate that I have
a tough job ahead of me.

I would expect there would be some reasonable relationship between any two
integer benchmarks. That doesn't make one a good substitute for the other,
however. I think there are specific reasons why Dhrystone is not as good
at predicting integer performance on my real applications as SPECint is.
They have been listed in this group before. And, as I pointed out a few
months ago, superscalar machines change the results radically. The Dhrystone
code has less instruction to instruction data dependence than normal code,
as it was designed in a scalar processor era to execute a certain group
of instructions according to a measured frequency. So, for example, you
might see statements like:

a = b + c; /* do an addition so we can have so many additions in the loop */
x = y - z; /* similarly, do a subtraction */

There is no data dependence, so a superscalar machine can just do them
simultaneously. This happens less frequently in real code than in Dhrystone,
so SPECint92 and Dhrystone 1.1 are destined to diverge drastically in the
near future. For example:


SPECint92 Dhrystone MIPS MIPS/SPECint92
--------- -------------- --------------
Sun IPX 21.8 28.5 1.31
Sun SS10 (36MHz) 44.2 86.1 1.95

So, does moving from an IPX to a 36MHz SS10 improve your integer performance
by 103%, or by 202% ? Before I buy, I need to know --- there's a pretty big
difference. When I move from an HP 9000/710 to an HP 9000/730, does my
integer performance improve by 35% (Dhrystone 1.1) or by 51% (SPECint92) ?
When I move from the 710 to the 720, does my integer performance stay the
same (Dhrystone 1.1 --- so why spend any extra money on the 720?) or does it
improve by about 13% (SPECint92) ?

Let's put poor old Dhrystone to rest. It served its purpose in a
scalar CPU world of similar CISC architectures with similar sized
caches. In a world of widely differing approaches to cache design,
with different degrees of superscalarity in the CPU, it is more
misleading than useful. In order to make a synthetic benchmark mimic
the data dependencies, instruction frequencies, basic block sizes,
branch prediction accuracy, etc. --- all of which are important
parameters in modern CPU design, and which do differ across the
various vendors' approaches --- AND make the synthetic code have the
same cache hit/miss ratio as real code --- would take herculean
effort. And if you succeeded, how would it be better than the real
codes in SPECint92?

Synthetic benchmarks are dead outside of marketing departments and ill informed
buyers of equipment.


--
-----------------------------------------------------------------------------
"It is seldom that any liberty is lost all at once." David Hume
||| cl...@virginia.edu (Clark L. Coleman)

Alfred A. Aburto

unread,
Aug 29, 1992, 10:24:40 PM8/29/92
to
In article <1992Aug26.1...@murdoch.acc.Virginia.EDU> cl...@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:
>In article <1992Aug23....@nosc.mil> abu...@nosc.mil (Alfred A. Aburto) writes:

In article <1992Aug23....@nosc.mil>
abu...@nosc.mil (Alfred A. Aburto) writes:

>>However, even the SPEC results will vary, as will any measure of
>>performance, when the underlying test parameters change. See SunWorld
>>magazine, Mar 1992, pg 48, "SPEC: Odyssey of a benchmark", where
>>geometric mean SPECmark ratings of 17.8, 20.0, 20.8, and 25.0 were
>>measured on the same machine using different compilers.

>This is given as an example of "noisiness" or "error proneness" in SPEC,
>but SPEC was intended to measure the SYSTEM, including the compilers. If
>one vendor has better compilers than another, it matters to the end users.
>Conversely, trying to benchmark the raw hardware in some way that filters
>out compiler differences would not be interesting to people who have to
>purchase the systems and use the compilers, not just the hardware.

It was given as an example of the need to show (indicate) the 'spread' in
the results. There is not just ONE result, there are numerous results,
depending upon many parameters that are difficult to control. The vendors
system may get one result (25.0), but the users system may get an entirely
different result. In the SunWorld article, even after alot of trouble,
they were unable to duplicate the vendors result (25.0). They finally
settled on 20.8 as the best they could do and left it at that. I'm saying
that unnecessary troubles may have been avoided if the vendor had said
instead something like: "this system has a rating of 21.0 +/- 4.0, and
you'll achieve peak performance of approximately 25.0 by use of this
compiler with these options, and 17.0 with this other compiler with
these options." Or some simple statement such as that, using perhaps
more appropriately the Maximum and Minimum results instead of the
standard deviation or RMS error. It would have avoided unnecessary
problems and been more informative overall. I don't want to hide any
information at all. I'm trying to say that we need to bring out more
information in the hope that it will avoid the type of problems
discussed in the SunWorld article. I'm not claiming to know exactly how
to rectify this situation. I'm just saying there appears to be a need to
do it.

Of course the 'spread' (or 'error' if you will) in performance due to
different compilers and compiler options is only one aspect of the
problem. Different types of programs produce different results and there
is a spread in performance there too. Program size and memory usage, main
memory speed, cache type, cache size, ..., etc. all produce a spread in
performance. The overall spread is considerable, and this is why system
testing is so difficult.

With regards to 'filtering' I was thinking of the need to 'filter' the
extreme data points. One learns about these extreme values by having
other similar _program_ results for comparison. This 'filtering' aspect
was not intended so much for a particular compiler result, but for a
particular program result. If program 'A' produces an order of magnitude
'better' result than 9 other _similar_ programs for a particular system
(compiler included) then I feel a need to do something about program 'A'.
At least I'd become very interested in program 'A' and try to figure
out why it produces results so different from the other programs.
Filtering is an option when a few of the system results for program
'A' show extreme outliers compared to other program results on the
same system (compiler included as part of the system). Its just an
option as one might not want to throw all the program 'A' results out
due to a few 'abnormal' results (due to extreme optimization for
example on 1 out 'M' programs and on a few out 'N' systems). Its just
an option and there are certainly many cases where one would not want
to filter at all. It depends on the data.

>The only real question in my mind about SPEC's approach is allowing
>vendors to use different compiler switch settings for each individual
>benchmark, if it produces better numbers. I don't think many users can
>compile every program that many different times and run timing tests on
>each one.

This is a good point.
On the other hand one cannot fault vendors in trying to achieve the
optimum performance in each individual case, but unfortunately, as you
say, this makes it tough on the users trying to figure out what are the
best options to use for their own particular programs.

[more to follow]

Al Aburto
abu...@marlin.nosc.mil

-------

Alfred A. Aburto

unread,
Aug 30, 1992, 8:23:56 PM8/30/92
to

In Article <1992Aug26.1...@murdoch.acc.Virginia.EDU>

cl...@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:

In article <1992Aug23....@nosc.mil>
abu...@nosc.mil (Alfred A. Aburto) writes:

>>I have heard alot of Dhrystone 1.1 bashing in the past. I tried to
>>understand just how 'bad' Dhrystone was several years ago (it seems)
>>by correlating Dhrystone 1.1 results with SPECint89 results. I had
>>20 or so data points to work with. I thought perhaps there would be
>>little correlation and that Dhrystone really needed to be cast out
>>as a measure of performance, because of the greater confidence placed
>>in the SPEC results. Instead, they were highly correlated (0.92
>>correlation or so). Well, I had to revise my thinking.

>This kind of bogus correlation was debunked long ago by SPEC. As soon as
>more data points are added, the correlation gets worse. Try adding the
>SparcStation 10 numbers to your test, for example.

I didn't know SPEC had done that. Wish I had been info'd on the results.
I'm not surprised though, but I'm curious now to see what they did.
Actually the results (20 different systems I think) were fairly
representative of various systems available, so I'm curious to see in
what manner the correlation broke down.

One of the problems with 'benchmarking' is the lack of good well
documented data bases from which to work with.

>For that matter, just look at the HP 9000/710 versus the HP 9000/720.
>The only differences in the hardware are the larger caches on the 720.
>Since the 710 has large enough caches for the Dhrystone code, but not
>for some SPECint codes, it produces the same Dhrystones as the 720 but
>significantly lower SPECint.

The issue here is cache size. We know that cache size is an important
factor in performance relative to a programs size (or cache utilization
size). Dhrystone is a small program and hence produces similar results,
as you say, in small caches as in big caches. Dhrystone is not adequate
to gain an understanding of performance trade-offs relative to cache
size. Other programs of varying size are needed to understand the
'spread' in performance due to cache size relative to program size. We
need to understand the limitations of our test programs and use them
appropriately. It is far (far) from the mark to think that Dhrystone
is the only test program one should use.

SPECint has problems here as well, because there are plently of 'small'
programs available that will fit in the HP 9000/710 cache which will
perform just as well as on the HP 9000/720. Yet, as you indicated, the
SPECint results do not reflect this fact. There are reasons HP built the
HP 710 and 720. Lower cost might be one of them (I don't know really).
Perhaps also HP felt that there was a segment of users who would be just
as happy with the smaller cache in the 710. They really didn't need a
larger cache. They would take a hit on performance sometimes with their
larger programs (SPECint type result), but in general the smaller cache
machine was adequate for their purposes (Dhrystone type result).

>Similar poor correlations will be obtained for two different systems
>with very different cache sizes. Compare the HP9000/720 to a smaller
>cache machine like an IBM RS/6000 or Sun SS2. For example, here are some
>Spring, 1991, numbers:
>
> SPECint89 Dhrystone 1.1 MIPS MIPS/SPECint89
> --------- ------------------ --------------
>HP 9000/720 39.0 57 1.46
>DEC 5000/200 19.0 24.2 1.27
>IBM RS6000/550 34.5 56 1.62
>
>If I didn't have SPECint89 numbers, but wanted to derive them from
>available Dhrystone MIPS numbers, the third column above would indicate
>that I have a tough job ahead of me.


But they ARE correlated! You can see it just by looking at the
SPECint89 and Dhrystone1.1 numbers. It is incorrect to use the third
column (above) to make any predictions or draw conclusions as it
consists of ratio's of the raw data (program, 'benchmark', results).
I'll explain below.

I sorted the numbers in decreasing order and I added in the nsieve MIPS
results (see the table below). Forget about the individual magnitudes
because the scaling in each program is different. But look at the
numbers. They track one another. The step size from one result to the
next is different but overall the results are tracking fairly well. The
HP 720 ranks highest for all three program results. The DEC 200 ranks
lowest for all three program results. The IBM 550 ranks second in all
three program results. They are all telling the same story and they are
correlated. To check this qualitative correlation I also calculated the
mathematical linear correlation coefficient and the result shows that
they are all highly correlated. Correlation coefficients: SPECint89 to
Dhyrstone1.1 = 0.982, SPECint89 to nsieve = 0.999, and Dhrystone1.1 to
nsieve = 0.988.

SPECint89 Dhrystone 1.1 MIPS nsieve MIPS
--------- ------------------ --------------
HP 9000/720 39.0 57 50.2
IBM RS6000/550 34.5 56 43.8
DEC 5000/200 19.0 24.2 17.0

The details though are different. They are different because there is
error in all those measurements. The compilers are not the same. The
compiler options are not the same. The programs and what they do are
all different. Cache size is a factor too. The SPECint89 results are
a geometric mean of 4 programs while the Dhrystone and nsieve are the
mean of none (and thus more susceptable to error). In view of these
errors, it is amazing to me that the results are correlated at all!
But they are most definitely well correlated.

Because they are highly correlated doesn't mean you can pick numbers
out of the raw program results above and start making comparisons or
predictions. It just won't work because there are unaccounted errors
in each number and between the different programs. Even worse is to
take ratios like the MIPS/SPECint89. If there is error in the MIPS
result and error in the SPECint89 results then the fractional error
after the division is even worse than the fractional error in the
original numbers. For example (40 +/- 6) / (20 +/- 3) = 2 +/- 0.6
(approximately). The fractional error in the original numbers is
0.15 (15%) but it has doubled to 0.30 (30%) after the division. So
you see that the ratio is an even less reliable number to use for
comparison or prediction purposes, and particularly so because you were
using the raw data (program or 'benchmark' results) of which you don't
even know the error bounds. If you did have the error bounds for the
ratio's then you might have realized that you really could draw no
conclusion at all, and this is another reason why I think we need to
start understanding the errors in our measurements. It will help us
avoid drawing incorrect conclusions.

Taking the ratio, MIPS/SPECint89, destroyed the correlation and led you
to draw an erroneous conclusion about your 6 data samples. I noticed
others using the above type ratio's, but it is simply not correct to
do so.

The correct procedure is to take the data samples (benchmark results
which have random errors) and do a correlation. A linear correlation
worked well so we can go with that. The linear correlation between
nsieve MIPS and SPECint89 was quite strong at 0.999 so we'll go with
that. Now we can do a linear least-squares fit to derive a linear
relationship between the nsieve MIPS and SPECint89 samples we had to
work with. We find the following:

SPECint89 = 8.806 + 0.595 * nsieveMIPS.


SPECint89 Predicted SPECint89 Error
from nsieve Measured
HP 9000/720 38.7 39 -0.3
IBM RS6000/550 34.9 34.5 +0.4
DEC 5000/200 18.9 19 -0.1

Pretty interesting. Also note that I used the best (peak) values for
the nsieve numbers. This seemed ok since it seems to me people tend
to frequently report peak values for benchmark results anyway.

We can do the same thing for the Dhrystone and SPECint89 numbers:

SPECint89 = 5.571 + 0.5524 * Dhrystone1.1MIPS.


SPECint89 Predicted SPECint89 Error
from Dhrystone 1.1 Measured
HP 9000/720 37.1 39 -1.9
IBM RS6000/550 36.5 34.5 +2.0
DEC 5000/200 18.9 19 -0.1

Not as good as nsieve, but still not bad as the error is less than 6%.

Please note that the correlations and relationships established above
are really _only_ valid for the 9 data samples we had to work with. It
would be erroneous to take any other results and throw them into the
equations and think those results were correct. They probably won't be.
We have not done enough work for that. Besides it was already indicated
the correlation breaks down as the sample size increases.

My main concern is that we do things correctly. I think that we really
need to start understanding the errors in our measurements (benchmark
results). Until we do I think we are just going to keep making lots of
mistakes, and blantant errors, with those measurements. We are really
on shaky ground when we compare benchmark results and have no idea
of the magnitude of the error in those measurements.

Al Aburto
abu...@marlin.nosc.mil

-------


M. Edward Borasky

unread,
Sep 1, 1992, 4:56:48 PM9/1/92
to
In article <Btwo8...@nntp-sc.Intel.COM> jwre...@mipos2.UUCP (Jeffrey Reilly) writes:
>...
>used a regression analysis to derive the following equation:
> SPECmark = (0.77 x mips) + (0.70 x mflops) - 3.27.
Question: Where do the MIPS and MFLOPS numbers come from?

>SPEC's initial thought was to take the marketing philosophy that
>there is no such thing as bad publicity, and ignore the article.
Is this "marketing philosophy" meant to imply that those in
marketing have this philosophy?

>A SINGLE NUMBER FOR THE PRESS
>SPEC has always acknowledged that system performance couldn't be
>accurately represented by a single number. ...
> ... Only in consideration of the
>press, did SPEC generate a composite of the 10 numbers, the
>SPECmark.
Seems pretty inconsiderate to me to hold that performance can't
be accurately represented by a single number, then create one
"in consideration of the press." Single numbers are created
because people ask for them. Averages are created because
people don't trust peak speeds. It would be nice if SPEC could
estimate the appropriate-dimensional Amdahl's Law surfaces for
their benchmarks; if they ask me (REAL NICELY) I'll show them how
to do it. In the absence of this kind of analysis, an average
(with appropriate confidence limits, which I can also show them
how to do) seems to be a pretty reasonable representation of relative
computer performance, especially when dealing with commercially-
successful systems which are, since they ARE commercially successful,
by nature balanced for reasonably common workloads.

I went through all this stuff several years ago; I went around and
around with all of these issues and I found things that worked.
Admittedly, I was dealing with the simpler case of supercomputers:
CPU bound applications, performance measured in megaflops. And I
was working in a marketing organization at the time.

>REGRESSION ANALYSIS
>A regression analysis in some fields seems to be a reasonable way
>to estimate future values. Horse racing, weather predicting, and
>computer performance predictions seem to fall outside the
>predictive capabilities of this tool.

1. For sprint races, where raw speed is the most important
factor, you can ALMOST make a profit by betting on the horse
(or DOG) with the lowest AVERAGE time for the given race
length. It has a considerably better expectation than the ball-
type lotteries, Las Vegas slot machines or buying call options
on Intel in a bear market.

2. Least squares techniques are one of many methods used in
computational fluid dynamics, including weather forecasting.

3. I use non-linear regression to predict supercomputer performance
by fitting Amdahl's Law curves to benchmark data. It works. It's
a little more complicated than the above-noted formula, but it works.
The more data points I have the better it works. I can show other
people how to do it.

William L Larson

unread,
Sep 2, 1992, 10:15:48 AM9/2/92
to
In article <42...@ogicse.ogi.edu> bor...@ogicse.ogi.edu (M. Edward Borasky) writes:
>It would be nice if SPEC could
>estimate the appropriate-dimensional Amdahl's Law surfaces for
>their benchmarks; if they ask me (REAL NICELY) I'll show them how
>to do it.
>
>3. I use non-linear regression to predict supercomputer performance
>by fitting Amdahl's Law curves to benchmark data.

I may be dense, but what is Amdahl's Law? (Brief descriptions only, there
is no need for large volumes).

Eugene N. Miya

unread,
Sep 2, 1992, 1:25:47 PM9/2/92
to
In article <1992Sep2.1...@cs.sandia.gov> wll...@sandia.gov

(William L Larson) writes:
>I may be dense, but what is Amdahl's Law? (Brief descriptions only, there
>is no need for large volumes).

I suggest reading the paper. In fact one of the leading men in
parallel processing seems to think it should be re-read weekly (ak).
It is not particularly difficult, and very short.

%A Gene M. Amdahl
%T Validity of the single processor approach to achieving large scale computing
capabilities
%J AFIPS Proc. of the SJCC
%V 31
%D 1967
%P 483-485
%K grecommended91,
%K bmiya,
%K ak,
%X should be reread every week
%X Well known (infamous ?) Amdahl's law that suggests that if x %
of an algorithm is not parallelizable then the maximum speedup is 1/x.

And since you are at Sandia, I suggest trying "man amlaw" on UNICOS
and trying the amlaw command.

Its companion paper should also be read (even less technical).

%A D. L. Slotnick
%T Unconventional Systems
%J Proceedings AFIPS Spring Joint Computer Conference
%V 31
%D 1967
%P 477-481
%K grecommended,
%K btartar
%X The `Pro' side of the classic debate with Gene Amdahl on the future
of array and multi processors. Rather inflammatory introduction.
Not a very technical article, but it should be read. Short.

Daniel unfortunately passed away a few years ago.

--eugene miya, NASA Ames Research Center, eug...@orville.nas.nasa.gov
Resident Cynic, Rock of Ages Home for Retired Hackers
{uunet,mailrus,other gateways}!ames!eugene
Second Favorite email message: Returned mail: Cannot send message for 3 days

Joel Williamson

unread,
Sep 2, 1992, 2:18:25 PM9/2/92
to
In article <1992Sep2.1...@cs.sandia.gov> wll...@sandia.gov (William L Larson) writes:

In any system with two or more modes of operation, system performance is
dominated by the slowest mode. Thus, in a scalar/vector system, scalar
performance dominates. A parallel system is dominated by the
performance of the serial portion of a program.

Example:

A program on a single scalar processor takes 100 seconds to
complete. Imagine that an infinitely fast vector unit is added and that
half of the program can be vectorized. Now the program takes 50 seconds
to complete, so performance only doubles even though half the program
runs in zero time.

Likewise, run the program on an MPP with 1024 processors. Again
imagine that half the code can be parallelized, and you get the same
result as before. This time you increased processing power by 3 orders
of magnitude and performance only doubled. If you parallelize 99% of
the code, time drops to 1 second, so speedup on 1024 processors is a
factor of 100 -- serial performance dominates.

Best regards,

Joel Williamson
--

Clark L. Coleman

unread,
Sep 4, 1992, 5:02:45 PM9/4/92
to
In article <1992Aug31.0...@nosc.mil> abu...@nosc.mil (Alfred A. Aburto) writes:
>
>In Article <1992Aug26.1...@murdoch.acc.Virginia.EDU>
>cl...@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:
>>Similar poor correlations will be obtained for two different systems
>>with very different cache sizes. Compare the HP9000/720 to a smaller
>>cache machine like an IBM RS/6000 or Sun SS2. For example, here are some
>>Spring, 1991, numbers:
>>
>> SPECint89 Dhrystone 1.1 MIPS MIPS/SPECint89
>> --------- ------------------ --------------
>>HP 9000/720 39.0 57 1.46
>>DEC 5000/200 19.0 24.2 1.27
>>IBM RS6000/550 34.5 56 1.62
>>
>>If I didn't have SPECint89 numbers, but wanted to derive them from
>>available Dhrystone MIPS numbers, the third column above would indicate
>>that I have a tough job ahead of me.
>
>
>But they ARE correlated! You can see it just by looking at the
>SPECint89 and Dhrystone1.1 numbers. It is incorrect to use the third
>column (above) to make any predictions or draw conclusions as it
>consists of ratio's of the raw data (program, 'benchmark', results).
>I'll explain below.

I'll take the liberty of not including the text of your explanation, although
it was a good one, because I think we just aren't communicating here.

Here is my perspective: I am trying to determine how fast various machines
are. We are buying some workstations soon at my company, Acme Tool and Die.
My boss doesn't see what the big deal is about all this benchmarking stuff,
and doesn't want to get loaner machines from multiple vendors, port our
code to each, time the results, etc. He says it would take too much time,
as the porting of our code turns out to be nontrivial. So we are going to
stick to standard benchmarks. Unfortunately, he didn't buy my arguments
against trying to use a single benchmark number; he refuses to chart out
every SPECint result for his boss when he makes the final proposal for
what workstations to buy.

Now, I have Dhrystone 1.1 MIPS numbers available for various machines. I
have read the marketing literature, and they all assure me that only those
compiler optimizations that were specified by Reinhold Weicker as being
allowable for Dhrystone were used (no inlining, for example.) So I feel
pretty good about these numbers, as Dhrystone 1.1 numbers go.

I also have some SPECint92 numbers, and some SPECint89 numbers, but not
complete lists of both for all interesting machines, and neither one of
them for some machines.

Our applications rarely use floating point data, and are not heavy on
graphics or I/O, either. So, my boss tells me to just rank the machines
by their Dhrystone 1.1 MIPS numbers, and he will look over the results.
He is smart enough not to make a big deal out of one machine having 51
MIPS while another has 49 MIPS, but he wants this MIPS list as a rough
guide to integer performance.

The $64,000 question is: Are we on reasonably safe ground to use Dhrystone 1.1
MIPS in lieu of the SPECint92 numbers we wish we had?

You have made the statement that "There is a very high correlation between
SPECint and Dhrystone 1.1 MIPS", or something similar, several times. I see
two possibilities here:

1) The fact that the two numbers correlate highly does not necessarily imply
that one is a good substitute for the other if we are trying to get a
reasonably accurate ranking of the various candidate machines.

2) The correlation DOES indicate that Dhrystone 1.1 MIPS is pretty much as
good as SPECint92, if all you want is a single number for integer CPU
speed (not I/O or cache constrained performance.)

If you tell me #1 is the case, then your regression and correlation are of
pedantic interest only, and I see no point in continuing to debate this
matter any further.

If you say that #2 is the case, I have a very simple disproof.

Let's say that my list of machines includes the new, souped up version of
the DECstation 5000/200, with the clock sped up by a factor of 2.35, and
the memory and cache proportionately faster to keep up with it. I will
assume that SPECint92 tracks SPECint89 here, because I only have SPECint92
number for the 36 MHz SPARCstation 10 that I am about to use. The new DEC
machine has 44.8 SPECint92, and 57.0 Dhrystone 1.1 MIPS. These are in
direct 2.35 to 1 ratios to the DEC 5000/200 numbers above, so the new machine
will not disturb your old regression and correlation at all.

Now, on my list, I have shown my boss his choices, and one of them is the
36 MHz SPARCstation 10, which shows up with 86 Dhrystone 1.1 MIPS. I don't
have the SPECint92 yet --- their marketing department is working on it.
My boss decides that there might be some error in the MIPS values ("spread"
as you put it), but as there is a high correlation between the SPECint92
and the MIPS (he read this on the Internet somewhere :-) ), he isn't too
worried that the SPECint92 numbers will be very different when they come
out. So, he sees a 33% MIPS increase in the SPARCstation 10 over the
new DEC machine, for only 10% more cost, and figures that SPECint92 will
probably show the same 33% increase, or close to it. After all, this
highly touted statistical correlation must have some real world value,
right?

We buy the SPARCstation 10 machines. A month later, Sun releases their
SPECint92 numbers : 44.8, the exact same as the new DEC machine. So, we
have:

Machine: DEC5000/200super Sun SS-10 (36 MHz)
-------- ---------------- ------------------
MIPS 57 86
SPECint92 44.8 44.8

NOTE: The above numbers are 2.35 to 1 ratios for the DEC5000/200, and so could
reflect a hypothetical but reasonable speed up of that architecture. The Sun
numbers are actual numbers from Sun.

In examining the Sun machines, we find that there was almost exactly a doubling
of SPECint92 performance from the SS-2 to the SS-10 at 36 MHz, but there was
a tripling of the Dhrystone 1.1 MIPS. Which is the better indicator of integer
CPU speed? I contend that Dhrystone (any version) is rapidly being obsoleted
EVEN IN THE SINGLE NUMBER BENCHMARKING world. I gave detailed reasons in a
previous posting that relate to superscalar instruction scheduling.

The reason that MIPS/SPECint92 ratios matter, despite your previous objections,
is that widely different ratios will create a large spread between the
realistic integer CPU performance expectations for a machine and the Dhrystone
1.1 MIPS estimate of its integer CPU performance. Based on the ratios that are
found between the SS-10 and the SS-2 on better benchmarks than Dhrystone (such
as SPECint92), the SS-10 should have about 57 Dhrystone MIPS. That it has 86
MIPS instead gives us a large range (from 57 to 86) into which we can expect
to find a competing machine someday (if not already.) If that competing
machine is not superscalar in its integer functional units, it will be likely
that its 70 or so Dhrystone 1.1 MIPS are a better indicator of its integer
performance than the 86 MIPS are for the SS-10; and it will be likely that
its SPECint92 will be significantly higher than the 44.8 of the SS-10. We
will then have a pair of machines where one has an 86 to 70 edge in MIPS,
and the other has a 58 to 45 edge in SPECint92. (I would not be surprised
to find that this relationship already exists between the SGI Crimson and
the SS-10 today.) In which case, I should forget Dhrystone MIPS and stick
to SPECint92, which was my whole point in the first place. Q.E.D.

P.S. The correct possibility between the two given above is #1. The
correlation indicates that at the time you measured and did your regression,
CPU architectures were reasonably similar to each other in their behavior
on SPECint89 and Dhrystone 1.1 code. That the SS-10 demonstrates that
this is no longer true is reason enough to abandon Dhrystone timings for
the future. The other statistical point is that we benchmark in order
to compare machines (two competing new ones, or an upgraded machine and
our old machine) in order to provide objective input to purchasing decisions
and architecture evaluations. If you get 100 machines built with about the
same RISC philosophy, and 1 more machine that is the only superscalar machine
in the group, that one outlying point will not disturb your correlation
appreciably. But, if my purchasing decision comes down to one of the
conventional machines versus that one outlying point, we don't have the
other 99 data points to average in and make the fitted curve almost ignore
the outlier. What we have is a head to head comparison, and the question,
"Should I pay attention to the Dhrystone numbers when comparing these two
machines?" The answer is a resounding "No." SPECint92 will be a better
point of comparison because it is composed of real codes, and should have
more realistic characteristics with respect to superscalar scheduling than
the Dhrystone 1.1 code. You will notice also that the little table of 3
machines in my previous posting, included above, shows its highest ratio
of MIPS to SPECint89 by far on a superscalar IBM RS6000. When you come
down to an evaluation of 2 or 3 machines, the fact that one of them was
an outlier that did not disturb the regression you mention is not very
comforting to know. The correlation is, at that point, irrelevant.


--

Alfred A. Aburto

unread,
Sep 9, 1992, 9:23:58 PM9/9/92
to
-------
There are several points to be made with regards to the correlation
of Dhrystone 1.1 and SPECint89:

(1) You posted an article with an example of Dhrystone 1.1 and
SPECint89 results for 3 systems. You used the MIPS/SPECint89 ratio
to 'prove' that there was no correlation.

(2) I looked at the same results and saw a correlation. I showed
qualitatively that the results were roughly correlated (Dhrystone1.1
and SPECint89 results). I went through the mathematics too and that
result showed (for the data we were looking at) a strong correlation
between Dhrystone1.1 MIPS and SPECint89 (0.982). I also added in the
nsieve program MIPS results and there we found an even stronger
correlation between nsieve MIPS and SPECint89 (0.999). Similarly, if
we had examined the individual SPEC integer programs (GCC, ESP, LI,
and EQN) we would have found (I did this later) that GCC, ESP, LI,
and EQN are all correlated amongst themselves and also with SPECint89,
Dhrystone1.1 MIPS, and nsieve MIPS. The degree of correlation varied,
but in all cases it was relatively high (greater than 0.90).

(3) The results in (2) clearly showed that the conclusion drawn
in (1) was incorrect. The conclusion in (1) was incorrect because the
ratio "MIPS/SPECint89" destroyed the correlation and it mislead the
author causing an incorrect conclusion to be drawn. This was an almost
classic example of how numbers, WITH NO REGARD FOR THE ERRORS INVOLVED
IN THE MEASUREMENT OF THOSE NUMBERS, can lead us astray and cause us to
draw inappropriate, incorrect, and erroneous conclusions. This was the
most important point I wanted to make. It really has nothing to do with
correlation, Dhrystone1.1 MIPS or niseve MIPS versus SPECint89, or
anything like that. I want to repeat what I said before because I think
it is important to do so. We stand on very shaky ground when making
comparisons (of performance) when we have no idea of the magnitude of
the error involved in the things (measurements) we are comparing. It
is time we stopped thinking of benchmark results (system measures of
performance) as having zero error. The error could be small or it may
be quite large ...

I know there is some intrinsic error involved because I see it
when plotting various results. For example the Dhrystone1.1 MIPS results
versus system show alot more scatter (fluctuation, variance, peaks,
error, spread, or whatever one may want to call it) than the SPECint89,
GCC, ESP, LI, or EQN SPECratio, results versus the same systems. In my
view this makes Dhrystone1.1 more dangerous to use when comparing
system A with system B --- dangerous because this 'error' (or degree of
fluctuation if you will) appears to be rather large (maybe 2 to 3 times
larger than I see in the other program results). The fact that Dhrystone
shows more error than the SPECratio's or SPECint appears to be the main
difference between them. Otherwise they all pretty much paint the same
picture in terms of system performance (i.e., they are all correlated).
By the way, I know what causes the larger fluctuation in the Dhrystone
results. It is merely that some systems (including compiler) just do a
better job of optimizing the Dhrystone program than other systems. If
for example we were to somehow force all systems to optimize Dhrystone
the same way (to the same degree) then the fluctuation would be greatly
reduced.

I'm not sure right now exactly how to quantify the errors in
general. It is definitely not easy to do, but if we are going to advance
further in the benchmarking field I think it must be done. I do know
however if we time M different programs on N systems we can in fact
determine the error due to the M different programs for each of the N
systems. Certainly I think SPEC should start paying more attention to the
error involved in their estimates of system performance (SPECint and
SPECfp).


(4) Now here comes the real clincher. In (1) you picked 3 systems
with Dhrystone1.1 MIPS and SPECint89 results and attempted to show via
the 'MIPS/SPECint89' ratio that there was no correlation. Well, I came
along and in (2) showed clearly that there was in fact correlation,
and that the conclusions drawn in (1) were erroneous. You did not pick
your 3 examples wisely, because I can pick 3 other different system
results and prove mathematically ('rigorously') for these samples that
there is in fact NO correlation between Dhyrstone1.1 MIPS and SPECint89
(in fact for the 3 different systems I'm looking at, the correlation is
terrible at -0.191)! I will post these results if you wish, but I hope
for now you can just believe I can in fact prove NO correlation by
appropriately selecting a different set of 3 system results than you
chose in (1).

I hope now you can see why we can just about prove _anything_
by selectively, or randomly, picking isolated (or small groups of)
'benchmark' numbers out of a rather large basket of ad hoc results.

The only way out of this quagmire is to gain an understanding of the
errors associated with our measures (estimates) of system performance
and to use sufficiently large data bases so that the measures of
performance are stable and relatively independent of sample size.

Personally, I would never use Dhrystone alone or SPECint alone.
Instead I would gather as much information as possible, using as many
different programs as possible and as many different systems as
possible. If for a given program and system I had alot of data of the
same type (a SPEC value for example from the vendor and 4 other
different SPEC values from other independent sources) I would select
the median (or maybe peak value) as the representative measure of
performance for that program and system. I would then sort the results
in order of increasing performance and plot them for each program. I
might even do a least squares curve fit to the results to gain a
clearer picture of the relative ranking of the various system results.
After this I would feel 'somewhat safe' in drawing a conclusion and
in considering in other factors such as cost, etc.

First thing though is I need good data bases of results (well
documented results). We really need good data bases and we could
certainly use more good test programs of all types.

(5) Well, what about the correlation of Dhrystone1.1 and SPECint?
I have extended my own data base of results to 32 different systems
(previously 18 different systems) and the high correlation still holds.
It is even somewhat better than before. I am happy about this result
because there seems to be at least something in benchmarking that makes
sense. However because of the dangers of unknown errors and inadequate
sample sizes I discussed previously I would never use Dhrystone1.1 or
SPECint alone in drawing any conclusions. I would use Dhrystone1.1,
SPECint, nsieve and any other information I can get ahold of ...

Al Aburto
abu...@marlin.nosc.mil

-------

Patrick F. McGehearty

unread,
Sep 10, 1992, 12:11:25 PM9/10/92
to
In article <1992Sep10....@nosc.mil> abu...@nosc.mil (Alfred A. Aburto) writes:
some solid discussion about the need for realizing measurement have
errors.

>We stand on very shaky ground when making
>comparisons (of performance) when we have no idea of the magnitude of
>the error involved in the things (measurements) we are comparing. It
>is time we stopped thinking of benchmark results (system measures of
>performance) as having zero error. The error could be small or it may
>be quite large ...
>
I would like to elaborate on this theme. A common error is to assume
different benchmarks are random samples from a data space which are
independent from each other. In truth, they are not independent.
Thus, all of our developed intuition about the normal distribution,
standard deviations, etc, etc, can be very misleading.

I would suggest we think of errors in terms of modeling errors instead of
sampling errors. The model (the benchmark) is likely to be systematically
different from the reality (a particular user's workload) in ways that
are different for each user. If a variety of benchmarks all indicate
the same system is superior, and there is some reason to believe that
those benchmarks are in the same domain as target use of the system,
then selection decisions are easy. When different benchmarks give different
predictions about the best choice, then further study is necessary to
determine which benchmarks most closely model the intended use of the
machine. To do that, we need to understand what a benchmark is really
measuring, not what is claims to be measuring.

<The following is as aside about Dhrystone, unrelated to the rest of
the discussion. Skip it if you are tired of hearing about Dhrystone.>

Dhrystone is particularly flawed relative to today's technology, which is
why any attempt to justify it's continued use provokes widespread cries of
dismay amoung those who have studied it carefully. A single optimization
which has no importance for any real application that I have seen (at least
in the scientific market segment) makes a 20% performance difference on
Convex machines. Which Dhrystone number would you use? The one with or
without the optimization? Your real application performance will be the
same in either case.
[That optimization is doing an inline copy of a fixed length string
in C instead of calling the strcpy routine. The inline version
moves data 8 bytes at a time, while the library routine must examine
each byte of the source string for a possible zero to terminate the copy].
In addition, Dhrystone is heavily dominated by subroutine calls. The
typical Dhrystone subroutine call only executes a few operations before
returning, which is less than most C code I have looked at, and far less
than most Fortran code I have seen. Further, the few loops it does contain
are only executed once. If an optimizing compiler spends two operations
outside a loop to save one operation inside a loop, it will perform less
well on Dhrystones than not doing this optimization. It was a reasonable
attempt for it's time, but we have much better benchmarks available to us
now, such as those provided by SPEC and Perfect.

In so far as Dhrystone correlates with SPECint for some machines, I believe
that it does so because subroutine call overhead correlates with integer
performance for these machines. Vector machines, VLIW machines, superscalar
machines all have potential for executing integer operations much faster
than subroutine calls. Using Dhrystones and its correlation with SPECint
for traditional architectures to predict performance for current advanced
architectures is as likely to be correct as throwing darts blindfolded.

<<End of Dhrystone discussion>>


> I'm not sure right now exactly how to quantify the errors in
>general. It is definitely not easy to do, but if we are going to advance
>further in the benchmarking field I think it must be done. I do know
>however if we time M different programs on N systems we can in fact
>determine the error due to the M different programs for each of the N
>systems. Certainly I think SPEC should start paying more attention to the
>error involved in their estimates of system performance (SPECint and
>SPECfp).
>

I agree that error quantification is not easy, and any particular approach
may be very misleading for some classes of benchmark users. A major area
for biasing the results is in determining a base value for each program.
In "The Art of Computer Systems Performance Analysis", Raj Jain devotes an
entire chapter to "Ratio Games", followed by several chapters on what
to do about it. Another area is in the selection of benchmarks. For
example, if there are 9 benchmarks which focus on one aspect of performance
(like subroutine calls), and 1 benchmark which focuses on another aspect
(like block copies), then machines which are better at the first will look
much better on the benchmark aggregate than those which are better at the
second. If the actual application mix uses those two aspect equally, the
average use of total benchmark set should also use each equally.
That does not mean each benchmark should use each equally, just be careful
not to bias results by too many benchmarks measuring a single aspect of
performance.

Walter Bays

unread,
Sep 10, 1992, 1:35:46 PM9/10/92
to
A year ago there was a column in a respected magazine by a respected
industry consultant purporting to show a simple formula to "accurately"
predict SPECmarks from (Dhrystone) MIPS and MFLOPS. As proof he showed
a list of machines with predicted versus actual SPECmarks. As I
remember, the error was something like from 2% to 24% -- on that very
set from which the formula was derived by least-squares fit.

He also predicted that the Sun 3/80 had a negative SPECmark meaning.
(Does that mean it's *really* **slow**? Or that it completed the
benchmarks before he pressed RETURN?) Then the consultant computed the
formula for a couple of other DEC machines with the same architecture
(MIPS) as part of the original set, and the error was something like
12% to 40%.

But now I have discovered an even better SPECmark predictor among the
lesser-known writings of Nostradamus who, writing in the 13th century,
foretold the development of computers, the universal use of the
Dhrystone benchmark, the creation and widespread adoption of SPEC,
Reinhold Weicker the creator of Dhrystone becoming a major contributor
to SPEC, and joint ventures between IBM and Apple.

Nostradamus foretold that Sir Arthur Conan Doyle would write the
Sherlock Holmes stories, that they would be printed in anthology by
Baring-Gould, and that encoded in their text would be the all the
secrets of computer performance evaluation. To foretell the SPECmark
of a computer, you take the first three digits of its' model number and
turn to that page in the book. Scan to the first word beginning with
the first letter of the manufacturer's name and begin reading. Encode
each letter as a number, A=1, B=2, etc. Take the first seven numbers,
corresponding to the seven planets of the solar system. Multiply each
by the corresponding number in this list, and sum the products:
-0.266, 0.191, -0.367, 0.052, -0.145, 0.654, 0.881. The result is the
SPECmark.

To test the accuracy of this medieval mystic, I applied his formula to
SPEC results published in the Winter 1990 newsletter.

Data General Aviion 310 page 310, "dark over this matter"
Data General Aviion 5010 page 501, "Douglas, said the inspector"
MIPS RC 3260 page 326, "moved to rooms in"
MIPS RC 3240 page 324, "much has been written"
Sun 4/490 page 490, "subject with a wave"
Sun 4/330 page 330, "streak not only of"
Sun SPARCstation 1 page 1, "some who will read"

The letter codes are then used to compute SPECmarks. As you can see,
the maximum error of the Nostradamus method is 0.1%. This method is
far less laborious and error prone than actually running benchmarks.
And the accuracy is better than the dice roll, tea leaves, linear
regression, tarot card, or chicken entrail methods.

---Planetary Codes-------------- Predicted Actual
DG Aviion 310 4 1 18 11 15 22 5 9.71 9.7
DG Aviion 5010 4 15 21 7 12 1 19 10.1 10.1
MIPS RC 3260 13 15 22 5 4 20 15 17.3 17.3
MIPS RC 3240 13 21 3 8 8 1 19 16.1 16.1
Sun 4/490 19 21 2 10 5 3 20 17.6 17.6
Sun 4/330 19 20 18 5 1 11 14 11.8 11.8
Sun SS 1 19 15 13 5 23 8 15 8.41 8.4

Unfortunately, the text of the original manuscript is obscured in later
passages, so we do not know how to predict SPECint92 nor SPECfp92. Oh
well, I guess we'll have to keep running benchmarks after all.

---
Walter Bays walte...@eng.sun.com
Sun Microsystems, 2550 Garcia Ave., MTV15-404, Mountain View, CA 94043
(415) 336-3689 FAX (415) 968-4873

Alfred A. Aburto

unread,
Sep 14, 1992, 9:18:21 PM9/14/92
to
In article <1992Sep10....@news.eng.convex.com> pat...@convex.COM (Patrick F. McGehearty) writes:
>In article <1992Sep10....@nosc.mil> abu...@nosc.mil (Alfred A. Aburto) writes:
>some solid discussion about the need for realizing measurement have
>errors.

Thanks, from email I've received I was beginning to feel rather
defensive relative to Dhrystone. Mention anything remotely good about
Dhrystone, seems to cause some peoples hair to rise, finger nails
suddenly grow longer, fangs appear where there were none before, toe
nails become claws, a smile turns into a snarl, common sense and logic
go flying out the window ... :-)

>>
>>We stand on very shaky ground when making
>>comparisons (of performance) when we have no idea of the magnitude of
>>the error involved in the things (measurements) we are comparing. It
>>is time we stopped thinking of benchmark results (system measures of
>>performance) as having zero error. The error could be small or it may
>>be quite large ...
>>

>
>I would like to elaborate on this theme. A common error is to assume
>different benchmarks are random samples from a data space which are
>independent from each other. In truth, they are not independent.
>Thus, all of our developed intuition about the normal distribution,
>standard deviations, etc, etc, can be very misleading.
>

Yes, those measures can be misleading because they need to be calculated
relative to the probability density, but generally in benchmarking our
samples are drawn from unknown distributions. But there are measures of
performance ('non-parametric') that are independent of a particular
distribution. The cumulative distribution and median are examples. The
Spearman rank correlation is another.

>
>I would suggest we think of errors in terms of modeling errors instead
>of sampling errors. The model (the benchmark) is likely to be
>systematically different from the reality (a particular user's workload)
>in ways that are different for each user. If a variety of benchmarks
>all indicate the same system is superior, and there is some reason to
>believe that those benchmarks are in the same domain as target use of
>the system, then selection decisions are easy. When different
>benchmarks give different predictions about the best choice, then
>further study is necessary to determine which benchmarks most closely
>model the intended use of the machine. To do that, we need to
>understand what a benchmark is really measuring, not what is claims to
>be measuring.
>

That is a good point about modeling errors. An important point. Linpack
for example tests specific algorithms and a specific instruction mix,
flops examines performance from a different perspective using different
algorithms and instruction mixes, and the Livermore Loops yet another.
Linpack MFLOPS rating are widely quoted, but I'm absolutely sure that
one should not use Linpack to ascertain relative performance gains for
codes or instruction mixes substantially different than that used in
Linpack. The user needs to be careful and should not blindly accept
a given benchmark result as truth for all cases. I don't know if you
noticed but when Frank McMahnon posted a summary of Livermore Loops
results the HP 9000/730 had an LLOOPS rating of 16 MFLOPS while
Linpack gives 24 MFLOPS, and flops gives a rating of 27 MFLOPS. When
it comes to the IBM RS/6000 540 though, Linpack gives a high rating
while both LLOOPS and flops give much lower ratings (11 MFLOPS). We
must be careful and we must know what the programs are measuring, as
you said.

There are real errors too. These happen when a test program fails due
to some optimization usually. It has happened to Whetstone, Dhrystone,
Livermore Loops, and SPEC too. The failure mode is such that the
programs measure of performance has been compromised and it no longer
reflects real performance gains for a particular system. In the
Livermore Loops and SPEC these failures have been mended but not so
for Whetstone and Dhrystone.

[Some Dhrystone remarks about optimization left out]

I want to add however that all programs and benchmarks are sensitive
to compiler optimizations of one form or another and to different
degrees of optimization (some more sensitive than others). Dhrystone
is particularly sensitive to optimization unfortunately. That is, on
a particular system with no optimization one may get X Dhrys/sec
while with full optimization turned on one may get 3X Dhrys/sec. I
have a number of examples showing this. IBM and some HP compilers
are particularly good at optimizing Dhrystone.

The SPEC results are also sensitive to optimization. As a reference
point and to get a 'flavor' of how much compilers add to performance
I would like to see SPEC results reported with all optimizations
disabled and also, in addition, with full optimizations turned on.

>
>In so far as Dhrystone correlates with SPECint for some machines, I
>believe that it does so because subroutine call overhead correlates
>with integer performance for these machines. Vector machines, VLIW
>machines, superscalar machines all have potential for executing
>integer operations much faster than subroutine calls. Using
>Dhrystones and its correlation with SPECint for traditional
>architectures to predict performance for current advanced
>architectures is as likely to be correct as throwing darts
>blindfolded.
>
><<End of Dhrystone discussion>>
>

I'm not sure it is the subroutine call overhead that does it. After
all so many compilers nowadays inline the small subroutines removing
the call overhead altogether. It is simply that the results track
one another. The fastest systems tend to give both the highest
SPECint and Dhrystone ratings (roughly). The slowest systems also
tend to give both the lowest SPECint and Dhrystone ratings (again
roughly). The middle performers also tend to be in the middle for
both SPECint and Dhrystone (again roughly so). It is in the details
where things fall apart in a hurry. The SPECint89 results (the only
numbers I have --- about 70 different system results) appear to be
very stable. The arithmetic mean agrees very closely with the
geometric mean and the standard deviation is relatively small (less
than +/- 3). The distribution of results appears to be bi-modal but
I think that is just an artifact of an insufficient sample size. The
individual programs (GCC, ESP, LI, and EQN) are well correlated with
each other and also with SPECint89. The ESP results seem to show the
highest correlation with SPECint89, but not by much. The Dhrystone
results on the other hand, despite the apparent correlation with
SPECint89, show a huge 'spread' in performance as compared to
SPECint89. A plot of these results along with SPECint89 very clearly
shows that it is unsafe to use Dhrystone in making performance
comparisons across different systems (too much scatter). That's it
in a nutshell. And thats why I think we need to quantify the spread
in our measures of performance a bit better than we have --- to help
prevent us from drawing incorrect conclusions.

The apparent (because I still need more samples) correlation between
SPECint89 and Dhrystone is disproved not by words, but by actually
doing it with a sufficiently large data set. I'm not advocating
anything, I am merely pointing out an unexpected and curious result
and I'd like to learn more about it, but I need more data. Also, as
I've indicated previously, it is not only Dhrystone but all the
integer programs I've looked at that seem to show this correlation.
GCC SPECratio89, LI SPECratio89, ESP SPECratio89, EQN SPECratio89,
nsieve MIPS, Dhrystone MIPS, and SPECint89 are all (apparently)
showing relatively high correlations with one another.

[I'm leaving out some good points you brought out about 'ratio games',
and how we can get misleading and biased results by not having a good
mix of different types of programs]

Al Aburto
abu...@marlin.nosc.mil
14 Sep 1992

-------


0 new messages