Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Message from discussion Geometric Mean or Median
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Alfred A. Aburto  
View profile  
 More options Aug 30 1992, 2:34 pm
Newsgroups: comp.benchmarks
From: abu...@nosc.mil (Alfred A. Aburto)
Date: Mon, 31 Aug 1992 00:23:56 GMT
Local: Sun, Aug 30 1992 8:23 pm
Subject: Re: Geometric Mean or Median

In Article <1992Aug26.160240.20...@murdoch.acc.Virginia.EDU>
cl...@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:

In article <1992Aug23.114309.3...@nosc.mil>
abu...@nosc.mil (Alfred A. Aburto) writes:

>>I have heard alot of Dhrystone 1.1 bashing in the past. I tried to
>>understand just how 'bad' Dhrystone was several years ago (it seems)
>>by correlating Dhrystone 1.1 results with SPECint89 results. I had  
>>20 or so data points to work with. I thought perhaps there would be
>>little correlation and that Dhrystone really needed to be cast out
>>as a measure of performance, because of the greater confidence placed
>>in the SPEC results. Instead, they were highly correlated (0.92
>>correlation or so).  Well, I had to revise my thinking.
>This kind of bogus correlation was debunked long ago by SPEC. As soon as
>more data points are added, the correlation gets worse. Try adding the
>SparcStation 10 numbers to your test, for example.

I didn't know SPEC had done that. Wish I had been info'd on the results.
I'm not surprised though, but I'm curious now to see what they did.  
Actually the results (20 different systems I think) were fairly
representative of various systems available, so I'm curious to see in
what manner the correlation broke down.

One of the problems with 'benchmarking' is the lack of good well
documented data bases from which to work with.

>For that matter, just look at the HP 9000/710 versus the HP 9000/720.
>The only differences in the hardware are the larger caches on the 720.  
>Since the 710 has large enough caches for the Dhrystone code, but not
>for some SPECint codes, it produces the same Dhrystones as the 720 but
>significantly lower SPECint.

The issue here is cache size. We know that cache size is an important
factor in performance relative to a programs size (or cache utilization
size). Dhrystone is a small program and hence produces similar results,
as you say, in small caches as in big caches. Dhrystone is not adequate
to gain an understanding of performance trade-offs relative to cache
size. Other programs of varying size are needed to understand the
'spread' in performance due to cache size relative to program size. We
need to understand the limitations of our test programs and use them
appropriately. It is far (far) from the mark to think that Dhrystone
is the only test program one should use.

SPECint has problems here as well, because there are plently of 'small'
programs available that will fit in the HP 9000/710 cache which will
perform just as well as on the HP 9000/720. Yet, as you indicated, the
SPECint results do not reflect this fact. There are reasons HP built the
HP 710 and 720. Lower cost might be one of them (I don't know really).
Perhaps also HP felt that there was a segment of users who would be just
as happy with the smaller cache in the 710. They really didn't need a
larger cache. They would take a hit on performance sometimes with their
larger programs (SPECint type result), but in general the smaller cache
machine was adequate for their purposes (Dhrystone type result).

>Similar poor correlations will be obtained for two different systems
>with very different cache sizes. Compare the HP9000/720 to a smaller
>cache machine like an IBM RS/6000 or Sun SS2. For example, here are some
>Spring, 1991, numbers:

>                 SPECint89    Dhrystone 1.1 MIPS      MIPS/SPECint89
>                 ---------    ------------------      --------------
>HP 9000/720       39.0            57                      1.46
>DEC 5000/200      19.0            24.2                    1.27
>IBM RS6000/550    34.5            56                      1.62

>If I didn't have SPECint89 numbers, but wanted to derive them from
>available Dhrystone MIPS numbers, the third column above would indicate
>that I have a tough job ahead of me.

But they ARE correlated!  You can see it just by looking at the
SPECint89 and Dhrystone1.1 numbers. It is incorrect to use the third
column (above) to make any predictions or draw conclusions as it
consists of ratio's of the raw data (program, 'benchmark', results).
I'll explain below.

I sorted the numbers in decreasing order and I added in the nsieve MIPS
results (see the table below). Forget about the individual magnitudes
because the scaling in each program is different. But look at the
numbers. They track one another. The step size from one result to the
next is different but overall the results are tracking fairly well. The
HP 720 ranks highest for all three program results. The DEC 200 ranks
lowest for all three program results. The IBM 550 ranks second in all
three program results. They are all telling the same story and they are
correlated. To check this qualitative correlation I also calculated the
mathematical linear correlation coefficient and the result shows that
they are all highly correlated. Correlation coefficients: SPECint89 to
Dhyrstone1.1 = 0.982, SPECint89 to nsieve = 0.999, and Dhrystone1.1 to
nsieve = 0.988.

                 SPECint89  Dhrystone 1.1 MIPS    nsieve MIPS  
                 ---------  ------------------   --------------
HP 9000/720       39.0            57                  50.2
IBM RS6000/550    34.5            56                  43.8
DEC 5000/200      19.0            24.2                17.0

The details though are different. They are different because there is
error in all those measurements. The compilers are not the same. The
compiler options are not the same. The programs and what they do are
all different. Cache size is a factor too. The SPECint89 results are
a geometric mean of 4 programs while the Dhrystone and nsieve are the
mean of none (and thus more susceptable to error). In view of these
errors, it is amazing to me that the results are correlated at all!
But they are most definitely well correlated.

Because they are highly correlated doesn't mean you can pick numbers
out of the raw program results above and start making comparisons or
predictions. It just won't work because there are unaccounted errors
in each number and between the different programs. Even worse is to
take ratios like the MIPS/SPECint89. If there is error in the MIPS
result and error in the SPECint89 results then the fractional error
after the division is even worse than the fractional error in the
original numbers. For example (40 +/- 6) / (20 +/- 3) = 2 +/- 0.6
(approximately). The fractional error in the original numbers is
0.15 (15%) but it has doubled to 0.30 (30%) after the division. So
you see that the ratio is an even less reliable number to use for
comparison or prediction purposes, and particularly so because you were
using the raw data (program or 'benchmark' results) of which you don't
even know the error bounds. If you did have the error bounds for the
ratio's then you might have realized that you really could draw no
conclusion at all, and this is another reason why I think we need to
start understanding the errors in our measurements. It will help us
avoid drawing incorrect conclusions.

Taking the ratio, MIPS/SPECint89, destroyed the correlation and led you
to draw an erroneous conclusion about your 6 data samples. I noticed
others using the above type ratio's, but it is simply not correct to
do so.

The correct procedure is to take the data samples (benchmark results
which have random errors) and do a correlation. A linear correlation
worked well so we can go with that. The linear correlation between
nsieve MIPS and SPECint89 was quite strong at 0.999 so we'll go with
that. Now we can do a linear least-squares fit to derive a linear
relationship between the nsieve MIPS and SPECint89 samples we had to
work with. We find the following:

                SPECint89 = 8.806 + 0.595 * nsieveMIPS.

                SPECint89 Predicted    SPECint89      Error
                   from nsieve          Measured
HP 9000/720          38.7                 39           -0.3
IBM RS6000/550       34.9                 34.5         +0.4
DEC 5000/200         18.9                 19           -0.1

Pretty interesting. Also note that I used the best (peak) values for
the nsieve numbers. This seemed ok since it seems to me people tend
to frequently report peak values for benchmark results anyway.

We can do the same thing for the Dhrystone and SPECint89 numbers:

                SPECint89 = 5.571 + 0.5524 * Dhrystone1.1MIPS.

                SPECint89 Predicted    SPECint89      Error
                from Dhrystone 1.1      Measured
HP 9000/720          37.1                 39           -1.9
IBM RS6000/550       36.5                 34.5         +2.0
DEC 5000/200         18.9                 19           -0.1

Not as good as nsieve, but still not bad as the error is less than 6%.

Please note that the correlations and relationships established above
are really _only_ valid for the 9 data samples we had to work with. It
would be erroneous to take any other results and throw them into the
equations and think those results were correct. They probably won't be.
We have not done enough work for that. Besides it was already indicated
the correlation breaks down as the sample size increases.

My main concern is that we do things correctly. I think that we really
need to start understanding the errors in our measurements (benchmark
results). Until we do I think we are just going to keep making lots of
mistakes, and blantant errors, with those measurements. We are really
on shaky ground when we compare benchmark results and have no idea
of the magnitude of the error in those measurements.

Al Aburto
abu...@marlin.nosc.mil

-------


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.