LINEST with r2 = -1.18 ???

3207 views

Richard Nolan

Sep 25, 2001, 5:43:57 PM9/25/01
to
Having used LINEST for Linear regression, I think
successfully a few times, I now have a data set that
returns an r2 value of -1.18, which is not possible. I can
look at the data and tell r2 must be +, not negative.

Are there two logic problems with LINEST. (a) r2 can never
greater than +/- 1, and (b) I can see the relationship is
+, not -.

Harlan Grove

Sep 25, 2001, 4:39:28 PM9/25/01
to
Richard Nolan <rno...@mbagolf.com> wrote...

Well, r2 stand for 'R-squared', and any real number squared is perforce
nonnegative (0^2 = 0, so not necessarily positive). This isn't just a wrong
answer in context, it's a mathematically impossible result. Unfortunately,
mathematically impossible results are not infrequent results from Excel's
statistical functions.

Given your LINEST X and Y values, you could calculate r2 directly as

=(MMULT(
MMULT(
MMULT(TRANSPOSE(Y),X),
MINVERSE(MMULT(TRANSPOSE(X),X))
),
MMULT(TRANSPOSE(X),Y)
)-COUNT(Y)*AVERAGE(Y)^2))/SUMPRODUCT(Y,Y)

If this also gives a negative result, show us your data (if it's not too
large).

Jerry W. Lewis

Sep 26, 2001, 9:39:44 AM9/26/01
to
LINEST() (also SLOPE(), INTERCEPT(), VAR(), STDEV(), LOGEST(), TREND(),
FORECAST(), etc.) uses a numerically unstable algorithm. With
challenging data (such as yours), rounding error has accumulated to the
point that none of its calculations (slope, intercept, etc.) can be
believed. In your case, you were lucky enough to get an impossible R^2,
so that it was obvious that there was a problem. There may still be a
problem even with data that give more reasonable R^2 values. These
problems with Excel's algorithms have been well documented for years
(cf. Sawitzki, 1994, "Report on the reliability of data analysis
systems" Comput. Statist. Data Anal. 18:289-301) yet Microsoft continues
to ignore them.

Harlan Grove's matrix formulation simply recreates the same problem.

DEVSQ(), COVAR(), and CORREL() are the only 2nd moment functions in
Excel that are numerically reliable. For simple linear regression, use
the following formulas instead of LINEST(), SLOPE(), INTERCEPT(), RSQ(),
STEYX(), etc.

slope = COVAR(y,x)/DEVSQ(x)*COUNT(y)
intercept = AVERAGE(y) - slope*AVERAGE(x)
rsq = CORREL(y,x)^2
SSreg = rsq*DEVSQ(y)
SSresid = (1-rsq)*DEVSQ(y)
df = COUNT(y)-2
F = SSreg/SSresid*df
steyx = SQRT(SSresid/df)
se1 = steyx/SQRT(DEVSQ(x))
seb = steyx/SQRT(1/COUNT(y)+AVERAGE(x)^2/DEVSQ(x))

This approach has the added advantage over LINEST that it allows missing
values in the data range. However that cuts both ways, because they
will give a wrong answer if there are data pairs where only x or y (but
not both) are missing.

Similarly, for univariate statistics use the following formulas instead
of VAR(), VARP(), STDEV(), and STDEVP()

var = DEVSQ(x)/(COUNT(x)-1)
varp = DEVSQ(x)/COUNT(x)
stdev = SQRT(var)
stdevp = SQRT(varp)

Since Microsoft has already programmed routines that would be superior
to their unstable routines, it is puzzling why they continue to maintin
redundant inferior code. The unstable formulas that Excel programed are
mathematically exact (with infinite precision), so my formulas will
agree with the Excel functions for non-challenging data sets. When they
disagree, the dedicated Excel functions are wrong.

There is no DEVSQA function, there is no hel for VARA(), VARPA(),
STDEVA(), or STDEVPA() other than doing those calculations manually.

If you are wedded to using LINEST(), then test to see if

STDEV(x) = SQRT(DEVSQ(x)/COUNT(x))
STDEV(y) = SQRT(DEVSQ(y)/COUNT(y))
PEARSON(y,x) = CORREL(y,x)

If all three of these are approximately true (say to at least 12 figures
each), then LINEST() can probably be believed for simple linear
regression. Figuring out when LINEST() can be believed for more complex
models is not so simple.

Jerry

Harlan Grove

Sep 26, 2001, 5:14:41 PM9/26/01
to
Jerry W. Lewis <JWLe...@mediaone.net> wrote...
...

>Harlan Grove's matrix formulation simply recreates the same problem.
>
>DEVSQ(), COVAR(), and CORREL() are the only 2nd moment functions in
>Excel that are numerically reliable. For simple linear regression, use
>the following formulas instead of LINEST(), SLOPE(), INTERCEPT(), RSQ(),
>STEYX(), etc.
>
>slope = COVAR(y,x)/DEVSQ(x)*COUNT(y)
>intercept = AVERAGE(y) - slope*AVERAGE(x)
>rsq = CORREL(y,x)^2
>SSreg = rsq*DEVSQ(y)
>SSresid = (1-rsq)*DEVSQ(y)
>df = COUNT(y)-2
>F = SSreg/SSresid*df
>steyx = SQRT(SSresid/df)
>se1 = steyx/SQRT(DEVSQ(x))
>seb = steyx/SQRT(1/COUNT(y)+AVERAGE(x)^2/DEVSQ(x))
...

I won't dispute that the direct calculation may have a problem with
precision. The matrix multiplies should be no big deal, but the matrix
inversion could very well be a big problem.

The formulas above (and the others offered) are OK for univariate analysis,
but they're no good for multivariate analysis. If there's more than one
independent variable, there's no practical alternative to matrix methods.

The only realistic conclusion to be drawn is that Excel just isn't adequate
for serious statistical analysis. No doubt there are superior ways to
calculate (X' X)^(-1) X', the heart of matrix-based multivariate linear
regression, than just plodding through the matrix operations one at a time.

This also calls into question the accuracy of ANY calculations using Excel's
MINVERSE on large matrices, especially sparse ones. Maybe Longre's
MINVERSE.EXT add-in function performs better with respect to loss of
precision.

Still, I'm curious to see what the OP comes up with as a result from the
direct matrix formula for R-squared.

Jerry W. Lewis

Sep 26, 2001, 9:10:24 PM9/26/01
to
The problem is summing squared values before the final subtraction in
Sum(x^2)-Sum(x)^2/2
for VAR(), STDEV(), etc., and in
Sum(x*y)-Sum(x)*Sum(y)/2
for the numerator of SLOPE(), PEARSON(), RSQ(), etc. The matrix
formulation still saves that subtraction for last, so my criticism is of
the algorithm, not the accuracy of MINVERSE().

With large data sets, or with large numbers that only change in low
order bits, most or even all of the relevant information may be lost
beyond the 15 figure accuracy of Excel's IEEE double-precision floating
point arithmatic. By first calculating the means and then squaring the
deviations from means in DEVSQ(), COVAR(), and CORREL(), rounding
problems are nearly eliminated.

For example, VAR(x,x+1) should be 0.5 regardless of x, but if x = 10^8,
then VAR() and the matrix approach must carry at least 18 figures to
give the correct answer while DEVSQ() only needs 9 figures to do it.

Jerry

Harlan Grove

Sep 26, 2001, 11:32:26 PM9/26/01
to
"Jerry W. Lewis" <JWLe...@mediaone.net> wrote...
>The problem is summing squared values before the final subtraction in
> Sum(x^2)-Sum(x)^2/2
>for VAR(), STDEV(), etc., and in
> Sum(x*y)-Sum(x)*Sum(y)/2
>for the numerator of SLOPE(), PEARSON(), RSQ(), etc. The matrix
>formulation still saves that subtraction for last, so my criticism is of
>the algorithm, not the accuracy of MINVERSE().

Truncation error! OK, the matrix formula I gave shortens to

R^2 = (Y' X (X' X)^(-1) X' Y - n y_bar^2) / Y' Y

Let U be a column vector of length n consisting of all ones. Then

y_bar^2 = Y' U U' Y / n^2

and

n y_bar^2 = Y' U U' Y / n

and letting W = U / n^(1/2),

n y_bar^2 = Y' W W' Y

so

R^2 = (Y' X (X' X)^(-1) X' Y - Y' W W' Y) / Y' Y

Last time I checked, matrix multiplication was distributive over matrix

R^2 = (Y' [X (X' X)^(-1) X' - W W'] Y') / Y' Y

putting the subtraction first rather than last. So create a defined name W
referring to the formula =SIGN(ROW(INDIRECT("1:"&COUNT(Y))))/SQRT(COUNT(Y)).
Then use the formula

=MMULT(
MMULT(
TRANSPOSE(Y),
MMULT(
MMULT(
X,
MINVERSE(MMULT(TRANSPOSE(X),X))
},
TRANSPOSE(X)
)-W*TRANSPOSE(W)
),
Y
)/SUMPRODUCT(Y,Y)

Is everyone happy now?

I just love people who point out flaws without offering to fix 'em.

Harlan Grove

Sep 27, 2001, 2:40:41 AM9/27/01
to
"Harlan Grove" <hrl...@aol.com> wrote...
...

> Then use the formula
>
>=MMULT(
> MMULT(
> TRANSPOSE(Y),
> MMULT(
> MMULT(
> X,
> MINVERSE(MMULT(TRANSPOSE(X),X))
> },
> TRANSPOSE(X)
> )-W*TRANSPOSE(W)
> ),
> Y
>)/SUMPRODUCT(Y,Y)
...

Oops! Not just typos, I also needed to remove the mean from the denominator
term. Might as well simplify things. Eliminate the defined name W -
unnecessary - and use the array formula

=MMULT(
TRANSPOSE(y),
MMULT(
MMULT(
x,
MMULT(
MINVERSE(MMULT(TRANSPOSE(x),x)),
TRANSPOSE(x)
)
)-1/n,
y
)
)/DEVSQ(y)

This _doesn't_ handle missing values, but it _should_ handle truncation
error.

Sep 27, 2001, 2:45:49 AM9/27/01
to
Man, what a refreshing thread! Nice to know I'm not alone in
frustration over the basic calcs.

It gets worse, sad to say. The binomial distribution is in bad shape;
take a simple case where we want the density (with respect to counting
measure) for a single flip of a coin coming up "heads":
=BINOMDIST(0,1,0.5,FALSE)
returns 0.5, as it should. Replace the first parameter with 1, and
again we get 0.5, as we should. Replace it with 0.3, say, and we
*should* get 0, not 0.5; replace it with any number less than 0, or
greater than 1, and we *should* get 0, not #NUM!

Now consider the distribution
=BINOMDIST(x,1,0.5,TRUE)

All is OK for 0 <= x <= 1, but the above should return 0 for x < 0, and
1 for x >= 1, not #NUM!.

Now try this:
=BINOMDIST(600,1200,0.5,TRUE)

We should get 0.5, not #NUM; seems MS's algorithm is trying to do a
combinatoric straight out, instead of using the incomplete beta
(already implemented) to do the calcs.

All of the other discrete distributions suffer similar problems; the
continuous ones have their own problems as well, if you look at them a
bit.

I can't help but wonder what software these folks use for financial
analysis. Surely it can't be Excel as it comes out of the box.

of Excel (Mac is in beta still, but XP is out). So these folks may
well respond to concerns, albeit slowly.

regards,

In article <3BB27CC7...@mediaone.net>, Jerry W. Lewis
<JWLe...@mediaone.net> wrote:

--
E-mail (ROT-13'ed): qoe...@ebpurfgre.ee.pbz

Harlan Grove

Sep 27, 2001, 4:54:52 AM9/27/01
to
...

>I can't help but wonder what software these folks use for financial
>analysis. Surely it can't be Excel as it comes out of the box.

<cynical rant!>
Once they own EVERYTHING, they won't need finanical software.

Why do you think they wanted to get Quicken?
</cynical rant!>

>of Excel (Mac is in beta still, but XP is out). So these folks may
>well respond to concerns, albeit slowly.

Added to colored worksheet tabs and full paths in headers/footers - a
compelling reason to spend a few hundred bucks on the upgrade. [Am I being
too sarcastic?]

Jerry W. Lewis

Sep 27, 2001, 8:06:24 AM9/27/01
to
For linear regression, the X'X matrix is

n Sum(x)
Sum(x) Sum(x^2)

and the X'y vector is

Sum(y)
Sum(xy)

with challenging data sets the truncation occurs before you attempt to
invert anything.

Jerry

Harlan Grove

Sep 27, 2001, 11:15:05 AM9/27/01
to

>For linear regression, the X'X matrix is
>
> n Sum(x)
> Sum(x) Sum(x^2)

It's only 2 by 2 - invert it.

| sum(x^2) -Sum(x) |
| -Sum(x) n |
--------------------------------
n Sum(x^2) - Sum(x)^2

Yes, that nasty denominator term has the truncation error. OK, now you've
got me going. I'm still irritated that you don't seem to think of offering
any sort of fix for this. I'd suspect the answer is analogous to replacing
the univariate variance formula

Sum(x^2)/n-Average(x)^2

with

Sum((x-Average(x))^2)/n

If X is an n by k matrix of the independent variables plus a column of 1s
for the constant term, then let Z be the n by k matrix in which each column
on all rows is the average of all rows in the corresponding column of X. In
other words, for U an n by 1 column vector of all 1s, Z = (U U' X) / n, or
in Excel terms, Z could be given by

=SIGN(ROW(x))*MMULT(TRANSPOSE(SIGN(ROW(x))),x)/ROWS(x)

Then use [(X - Z) + Z] in place of X.

(X' X)^(-1) X' y
= ([(X - Z) + Z]' [(X - Z) + Z])^(-1) [(X - Z) + Z]' Y
= ((X - Z)' (X - Z) + Z' (X - Z) + (X - Z)' Z + Z' Z)^(-1) ((X - Z)' Y + Z'
Y)
= ((X - Z)' (X - Z) + 2 X' Z - Z' Z)^(-1) ((X - Z)' Y + Z' Y)

At this point the Excel formula would get a bit long, but it'd make more
sense to wrap it in a udf anyway.

If this doesn't mitigate truncation error, what does?

Sep 27, 2001, 4:20:12 PM9/27/01
to
Oof, a bit too cynical, even for you. <g>
Nope, I have a glimmer of hope that things might be changing. I have
been writing them for many years now about the same persistent errors;
started in 1988 with STDEV and RANDBETWEEN (latter eventually fixed).
Seems a lot depends on who happens to be awake when we raise the
issues. And I don't know the best folks to target concerns to, though
I am sure as heck trying now through a current relationship with
Microsoft.

I'd suggest we just keep things alive in a polite way; Jerry mentioned
an article I didn't know of, but will have read by this weekend. There
must be a lot more out there like that; seems there was something
similar a few years back in "American Statistician". Why there are so
many disitinct algorithms within Excel to accomplish the same thing *in
principle* is puzzling (I have the routines for linear regression in
mind, among others); you'd think they would just do a single module
correctly.

Anyway, keep the faith. No question they have very talented coders, MS
just needs a willingness to acknowledge they made, and make, mistakes,
and are ready to act on feedback to correct them. *That* would be so
cool, and for me, show MS to be more attuned to customers than they
have been of yet. They are still aloof, when it comes to Excel, but
may be changing, at least if the Mac side of the business is any
indication.

regards,

In article <OVBFnHzRBHA.1528@tkmsftngp05>, Harlan Grove
<hrl...@aol.com> wrote:

--
E-mail (ROT-13'ed): qoe...@ebpurfgre.ee.pbz

Sep 27, 2001, 4:26:14 PM9/27/01
to
Jerry,
Lots of hot air so far; what have you got? Spell it out for us, please;
regards to this stuff.

Jerry W. Lewis

Sep 28, 2001, 9:04:17 AM9/28/01
to
Here are some references (including for completeness the one I sited
before):

- Sawitzki (1994) "Report on the reliability of data analysis

systems" Comput. Statist. Data Anal. 18:289-301

- Knüsel (1998) "On the accuracy of statistical distributions in
Microsoft Excel 97" Comput. Statist. Data Anal. 26:375-377

- McCullough & Wilson (1999) "On the accuracy of statistical
procedures in Microsoft Excel 97" Comput. Statist. Data Anal. 31:27-37.

These are the only things that Current Index to Statistics turns up for
the keyword "Excel" other than books and their reviews on doing
statistics with Excel (the books generally ignore the computational
problems, I can't say that I have read many of the reviews)

I don't recall the American Statistician article that you mention, but
when I get to the office I will start by checking out

- McCullough (1998) "Assessing the reliability of statistical
software: Part I" Amer. Statist. 52:358-366.

- McCullough (1999) "Assessing the reliability of statistical
software: Part II" Amer. Statist. 53:149-159.

John McKenzie of Babson College gave the following talk last month at
the 53rd ISI in Seoul Korea

- McKenzie (2001) "A critical review of how well reference manuals and
for statistical analysis"

Proceedings will be published, but details are sketchy on the meeting
web site

http://www.nso.go.kr/isi2001/

These references deal specifically with Excel. The algorithms and
criticisms of their numerical properties are much older than Microsoft.

I would be very interested in receiving other references.

Jerry

> ...

Sep 28, 2001, 11:18:30 AM9/28/01
to
Jerry,
bra...@rochester.rr.com?

TIA

in article 3BB4759E.3080707@no_e-mail.com, apparently Jerry W. Lewis at
post_a_reply@no_e-mail.com posted on 9/28/01 9:04 AM:

Harlan Grove

Sep 28, 2001, 12:31:40 PM9/28/01
to
...

>I don't recall the American Statistician article that you mention, but
>when I get to the office I will start by checking out
>
> - McCullough (1998) "Assessing the reliability of statistical
>software: Part I" Amer. Statist. 52:358-366.
>
> - McCullough (1999) "Assessing the reliability of statistical
>software: Part II" Amer. Statist. 53:149-159.
...

>These references deal specifically with Excel. The algorithms and
>criticisms of their numerical properties are much older than Microsoft.
...

Being blunt and directly to the point, what is the preferred method for
calculating multivariate linear regression coefficients? There must be some
best method to minimize truncation error while still using matrix methods.
What is it?

Criticism is fine. Saying 'Excel's built-in functions do that poorly' is
fine. But at some point reciting what Excel can't do ceases to be valuable.
At that point it's time to address work-arounds. You haven't. Either you
don't want to help the OP _solve_ the problem, or you don't know how to
solve it. Which is it?

Sep 28, 2001, 6:16:16 PM9/28/01
to
Harlan,
You crack me up. For *you*, of all people, to preface a remark as being
"blunt" is hilarious <vbg>. This thread has three oversized egos at work;

I'm going to start a new thread on distributions; I'll take them one-by-one,
down to the algorithm most appropriate for the distribution. First pass will

This is going to be, in a sense, "open-source" coding. There is a strong
engineering community here, as well as the math riff-raff like you <vbg>.
Let's just take the bull by the horns, OK?

Regards,

See new thread shortly entitled: "Discrete Distributions in Excel"

in article OQmEOSFSBHA.1972@tkmsftngp05, apparently Harlan Grove at
Hrl...@aol.com posted on 9/28/01 12:31 PM:

Harlan Grove

Sep 28, 2001, 4:53:25 PM9/28/01
to

>You crack me up. For *you*, of all people, to preface a remark as being
>"blunt" is hilarious <vbg>. . . .

However, I'm not too concerned with discrete distributions since they're not
relevant to the residuals from linear regression.

So . . . do you know the maximum precision algorithm for calculating
regression coefficients?

Sep 28, 2001, 8:05:27 PM9/28/01
to
Nope, I sure don't. But it's certainly worth a search for; not sure actually
that I would seek "maximum precision" as much as a solution that does better
than what we currently have; could also be that one would have to carry
around axtra doubles for synthetic division. Clueless on this one. But I
certainly *do* wish there was internal consistency among all the regression
routines in Excel.

Regards,

in article uA9oUkHSBHA.1368@tkmsftngp04, apparently Harlan Grove at
Hrl...@aol.com posted on 9/28/01 4:53 PM:

Jerry W. Lewis

Sep 29, 2001, 2:43:55 PM9/29/01
to

Harlan Grove wrote:

> Being blunt and directly to the point, what is the preferred method for
> calculating multivariate linear regression coefficients? There must be some
> best method to minimize truncation error while still using matrix methods.
> What is it?

I would have to research the topic to give an informed answer. For the
algorithms published in Applied Statistics, most of which can be found
on Statlib.

http://lib.stat.cmu.edu/

I have looked at the basic statistical calculations in Excel because
there is value added in teaching my clients to do some of that for
themselves. For more challenging work I use dedicated statistical
developers. Since David seems to think that MS might actually listen
this time, I might be willing to spend some time on it.

> Criticism is fine. Saying 'Excel's built-in functions do that poorly' is
> fine. But at some point reciting what Excel can't do ceases to be valuable.
> At that point it's time to address work-arounds. You haven't.

Actually this thread began on simple linear regression, and suggesting
workarounds by choosing Excel functions that do work well is exactly how

Jerry

Harlan Grove

Sep 29, 2001, 3:15:48 PM9/29/01
to
...
>Actually this thread began on simple linear regression, and suggesting
>workarounds by choosing Excel functions that do work well is exactly how

Yes, and I responded that they were only appropriate for problems with a
single independent variable, and from that point the discussion focussed on
matrix methods, and you haven't constributed anything positive since then.

If the OP has a single independent variable, this is a moot discussion -
your alternatie formulas suffice. On the other hand, if the OP has multiple
independent variables, then something more is needed. If truncation error is
the primary problem, I'd expect the refinement to the standard matrix
formula I offered a day or two ago would suffice for all but the most
pathological data.

Harlan Grove

Oct 1, 2001, 12:30:29 PM10/1/01
to
...
>Now try this:
>=BINOMDIST(600,1200,0.5,TRUE)
>
>We should get 0.5, not #NUM; seems MS's algorithm is trying to do a
>combinatoric straight out, instead of using the incomplete beta
>(already implemented) to do the calcs.
...

OK, I give up on the rest, but you're wrong here. It shouldn't be #NUM!, but
it's not 0.5. You need an odd n parameter and x = INT(n / 2) to get a
cumulative value of 0.5, e.g., BINOMDIST(5,11,0.5,True) = 0.5,
BINOMDIST(6,12,0.5,True) > 0.5. x = 0 has positive probability, and that
means for n even the rv x can take on n+1 (and odd number of) states.

You get to be right in the large, and I get to poke holes in your examples.
Happy?

BTW, R has no problem with your example: pbinom(600,1200,0.5) = 0.5115141
and pbinom(600,1201,0.5) = 0.5. Which leads me to wonder whether a set of
wrapper functions around the R binary wouldn't be expedient. For that
matter, R's code is GPL'ed, so freely available.

Oct 1, 2001, 7:33:30 PM10/1/01
to
Harlan.
You are right; it shouldn't be 0.5. I meant to write "about 0.5". Sorry.

Now: what the heck is "R"?!? Sort of a S-Plus like language? Compilable to
something we can tie into Excel?

Keep poking the holes, please. I want to get this *right*.

Did you try what I posted? It's simple, seems to work over all longs, but I
haven't pushed it that much; it always dominates Excel's approach, but may
need help in pathologic cases.

Regards, and many, many thanks.

in article uGkjc\$qSBHA.460@tkmsftngp04, apparently Harlan Grove at
Hrl...@aol.com posted on 10/1/01 12:30 PM:

Jerry W. Lewis

Oct 2, 2001, 2:13:26 AM10/2/01
to

> Now: what the heck is "R"?!? Sort of a S-Plus like language?

S is a language developed at Bell Labs. S-Plus is a commercial
(Insightful Corp.) implementation of (currently) v4 of the S language.
R is a free implementation of v3 of the S language. Each has its own
extensions to the S language.

> Compilable to something we can tie into Excel?

Harlan?

Jerry

Harlan Grove

Oct 2, 2001, 2:30:21 AM10/2/01
to
...
>>Compilable to something we can tie into Excel?
>
>Harlan?

I thought I stated that it was GPL'ed. The source code it freely available.
If nothing else, it could be lifted and translated into your language of
choice. Maybe a big translation job, but the 'wheel' may already be ready
for use rather than needing to be reinvented. But I'm certain that if we
came up with a derivative stats library based on GPL'ed code MS would have a
cow. So this must be a good idea!

Easy enough to check it out if you have a Win32 system.

Oct 5, 2001, 8:04:08 AM10/5/01
to
In article <3BB95B5E.4090706@no_e-mail.com>, Jerry W. Lewis

I know what S is. Have used S-Plus for years, though not of late, as I
no longer have a platform that will support it. I was trying to ask
what "R" is. Any idea? Also, what is GPL? General Public License?

TIA
Dave

Jerry W. Lewis

Oct 5, 2001, 8:49:28 AM10/5/01
to
Look in the R section on Statlib

http://lib.stat.cmu.edu/R/CRAN/

There appear to be R versions for Mac OS 8, 9, & X

Jerry

Harlan Grove

Oct 5, 2001, 8:53:46 AM10/5/01
to
...
> . . . Also, what is GPL? General Public License?

Yes, though there's an implicit prefix. The full name is the GNU General
Public License. R is a GNU project.

Oct 5, 2001, 2:49:23 PM10/5/01
to
Thanks, Jerry! Will do.

In article <3BBDACB7.2050807@no_e-mail.com>, Jerry W. Lewis

> Look in the R section on Statlib
>
> http://lib.stat.cmu.edu/R/CRAN/
>
> There appear to be R versions for Mac OS 8, 9, & X
>
> Jerry
>
>
> > In article <3BB95B5E.4090706@no_e-mail.com>, Jerry W. Lewis
> >
> >
> >>
> >>
> >>>Now: what the heck is "R"?!? Sort of a S-Plus like language?
> >>>
> >>
> >>S is a language developed at Bell Labs. S-Plus is a commercial
> >>(Insightful Corp.) implementation of (currently) v4 of the S language.
> >>R is a free implementation of v3 of the S language. Each has its own
> >>extensions to the S language.
> >>
> >> > Compilable to something we can tie into Excel?
> >>
> >>Harlan?
> >>
> >
> > I know what S is. Have used S-Plus for years, though not of late, as I
> > no longer have a platform that will support it. I was trying to ask
> > what "R" is. Any idea? Also, what is GPL? General Public License?
> >
> > TIA
> > Dave
> >
> >

--
E-mail (ROT-13'ed): qoe...@ebpurfgre.ee.pbz

Oct 5, 2001, 2:53:48 PM10/5/01
to

Jerry,
This is being fostered by Jan. Excellent!!! Check out his own pages
when you get a chance. Mac guy to the core. <vbg>

Dave

In article <3BBDACB7.2050807@no_e-mail.com>, Jerry W. Lewis

> Look in the R section on Statlib
>
> http://lib.stat.cmu.edu/R/CRAN/
>
> There appear to be R versions for Mac OS 8, 9, & X
>
> Jerry
>
>
> > In article <3BB95B5E.4090706@no_e-mail.com>, Jerry W. Lewis
> >
> >
> >>
> >>
> >>>Now: what the heck is "R"?!? Sort of a S-Plus like language?
> >>>
> >>
> >>S is a language developed at Bell Labs. S-Plus is a commercial
> >>(Insightful Corp.) implementation of (currently) v4 of the S language.
> >>R is a free implementation of v3 of the S language. Each has its own
> >>extensions to the S language.
> >>
> >> > Compilable to something we can tie into Excel?
> >>
> >>Harlan?
> >>
> >
> > I know what S is. Have used S-Plus for years, though not of late, as I
> > no longer have a platform that will support it. I was trying to ask
> > what "R" is. Any idea? Also, what is GPL? General Public License?
> >
> > TIA
> > Dave
> >
> >

--
E-mail (ROT-13'ed): qoe...@ebpurfgre.ee.pbz

rno...@mbagolf.com

Oct 5, 2001, 3:14:23 PM10/5/01
to
Since I started this thread, maybe I can finish it.
R2 is a statistical value called the "Coefficient of
Determination" It is one of many provided when using the
MS function LINEST. It tells how good a linear regression
line can be predicted by the original data. So, If R2 is 1
or -1 the relationship is perfect. If the R2 is 0 there is
no relationship. R2 can never be greater than +/- 1. My
original data, using LINEST produced an R2 value of -1.18,
so I started this discussion.
>.
>

Harlan Grove

Oct 5, 2001, 11:35:37 PM10/5/01
to
<rno...@mbagolf.com> wrote...
...

>R2 is a statistical value called the "Coefficient of
>Determination" It is one of many provided when using the
>MS function LINEST. It tells how good a linear regression
>line can be predicted by the original data. So, If R2 is 1
>or -1 the relationship is perfect. If the R2 is 0 there is
>no relationship. R2 can never be greater than +/- 1. My
>original data, using LINEST produced an R2 value of -1.18,
>so I started this discussion.
...

You're confusing R^2 with correlation. R^2 = ESS / TSS, explanatory sum of
squares divided by total sum of squares. The other piece is RSS, residual
sum of squares, and it fits in TSS = ESS + RSS. As a ratio of sums of
squares of real numbers (I'm assuming you're not performing linear
regression on complex data), R^2 _CAN'T_ be negative. Mathematically
impossible. An R^2 of 1.0 means you've got nonstochastic linear data. An R^2
of 0 means your data, stochastic or not, is perfectly nonlinear (e.g.,
fitting a full period sine wave).

What you've got is a very bad case of truncation error. How many data
points? What are the largest and smallest X values?

And the fundamental point of the thread is: Excel does stats poorly. If you
want to do stats well, you'll either need to roll your own stats procedures
in Excel or you'll have to use something else.

Jerry W. Lewis

Oct 6, 2001, 9:01:34 AM10/6/01
to

Harlan Grove wrote:

> <rno...@mbagolf.com> wrote...
> ...

> And the fundamental point of the thread is: Excel does stats poorly. If you
> want to do stats well, you'll either need to roll your own stats procedures
> in Excel or you'll have to use something else.

That is overstating sligtly, but not much. The general user has a right
to expect that RSQ(), SLOPE(), LINEST(), etc. will give the answers that
the functions claim to give. As you correctly point out, there are
circumstances where their answers are completely wrong, and the average
user has no basis for recognizing such circumstances.

My original post in this thread noted that DEVSQ(), CORREL() and COVAR()
do a very good job of controlling rounding error. If MS had based their
VAR, STD, and regression function series on DEVSQ() CORREL() and COVAR()
(cf. simple relations in post) then this thread (and far too many
similar ones in the past) would never have occurred.

Hopefully MS will finally commit to producing a quality product and will
eliminate the use if algorithms that were known to be inferior long
before Excel was a gleam in Bill's eye.

Jerry

Harlan Grove

Oct 8, 2001, 3:00:44 AM10/8/01
to
...

>Hopefully MS will finally commit to producing a quality product and will
>eliminate the use if algorithms that were known to be inferior long
>before Excel was a gleam in Bill's eye.

How long have the inferior Excel functions been around? How long have
articles in the statistical journals cataloged their failings?

The nice people in Redmond have thick hides and very large egos indeed. They
don't admit mistakes. Against the corporate culture. Indeed, I don't think
resulting from QUOTIENT(-1,5) and not comparing equal to (positive) zero has
confirmed their thinking (at least for me). They want benchmark figures.
Benchmark figures don't include accuracy of calculation results. No points
for deducing that fast inaccurate algorithms would be favored over slower
accurate ones.

I just assumed that the Excel programmers were a bunch of CS degree holders
with no clue about statistics, so they understood that, e.g., E[X^2] -
E[X]^2 would take only one loop while E[(X-E[X])^2] would take two, so it
was nothing more than understanding how to count op codes while simply being
ignorant of precision loss.

The negative zero problem shows either willful ignorance (they bury their
precision/accuracy if it affects speed. Intel FPU hardware is ANSI/IEEE
754-compliant with regard to negative zero and positive zero testing equal.
Bizarrely, Excel isn't, which means that Excel, at least at times, isn't
using floating point hardware for numerical comparisons. Why? To shave
execution time. Do I think they know about the +0 <> -0 problem? No, I'll be
required - ignoring the sign bits when all other bits are zero - would make
it more execution-efficient simply to use FPU compares. That would, however,
harm Excel's benchmarks, so I'll bet it's never fixed.

Actually, what really opened my eyes was QUOTIENT(-1,5) producing negative
zero in the first place. The formula =TRUNC(numerator/denominator) gives a
better result, giving positive zero when ABS(numerator) < ABS(denominator)
no matter what the respective signs are. So MS isn't using this rather
breathtakingly simple approach. No, they're almost certainly using a series
of low-level bit-wise operations. This goes beyond ignorance. This enters
the realm of gross negligence. At least they've covered their butts with
their disclaimers of any warranties. They really do mean it when they say
that Excel isn't warranted to be good for anything.

Oct 9, 2001, 7:10:08 PM10/9/01
to
In article <OcjVBW8TBHA.1880@tkmsftngp07>, Harlan Grove
<Hrl...@aol.com> wrote:

> Jerry W. Lewis <post_a_reply@no_e-mail.com> wrote...
> ...
> >Hopefully MS will finally commit to producing a quality product and will
> >eliminate the use if algorithms that were known to be inferior long
> >before Excel was a gleam in Bill's eye.
>
> How long have the inferior Excel functions been around? How long have
> articles in the statistical journals cataloged their failings?

Well, from the beginning; and every new stat function introduced has
been poorly done. Not sure about how far back the stat journal articles
go, but I alerted them in 1985 for the first problem. With Version 5
we got a slew of new ones; again alerted them, to no avail. Similar
experience for Jerry Lewis.

>
> The nice people in Redmond have thick hides and very large egos indeed.

Actually Harlan, they *are* nice. This isn't the result of a single
person's will; it is an organizational issue; perplexing, to be sure,
but I encounter a tremendous amount of good will within MS.

> They don't admit mistakes. Against the corporate culture. Indeed, I
> don't think they'd acknowledge these as mistakes.

I tend to agree with you as far as the Excel math functions go. For
other aspects of the product line, they are superb; security patches
come to mind.

<<snip big rant>>

> At least they've covered their butts with
> their disclaimers of any warranties. They really do mean it when they say
> that Excel isn't warranted to be good for anything.

Sheesh Harlan, lighten up. This isn't a "dark force" to deal with; it's
a huge (dis-)organization we are bringing around. I strongly urge you
to consider where things can go from here, rather than focus so
bitterly on how they have gone.

We *will* see changes; their people are now alerted to the issues that
we raise, and are moving forward to rectifying them. I strongly urge
you to save the "me versus Big Brother" rant for another venue; let's
just work on cataloging problems we find, so we can get the job done.

OK by you?