3207 views

Skip to first unread message

Sep 25, 2001, 5:43:57 PM9/25/01

to

Having used LINEST for Linear regression, I think

successfully a few times, I now have a data set that

returns an r2 value of -1.18, which is not possible. I can

look at the data and tell r2 must be +, not negative.

successfully a few times, I now have a data set that

returns an r2 value of -1.18, which is not possible. I can

look at the data and tell r2 must be +, not negative.

Are there two logic problems with LINEST. (a) r2 can never

greater than +/- 1, and (b) I can see the relationship is

+, not -.

Sep 25, 2001, 4:39:28 PM9/25/01

to

Richard Nolan <rno...@mbagolf.com> wrote...

Well, r2 stand for 'R-squared', and any real number squared is perforce

nonnegative (0^2 = 0, so not necessarily positive). This isn't just a wrong

answer in context, it's a mathematically impossible result. Unfortunately,

mathematically impossible results are not infrequent results from Excel's

statistical functions.

Given your LINEST X and Y values, you could calculate r2 directly as

=(MMULT(

MMULT(

MMULT(TRANSPOSE(Y),X),

MINVERSE(MMULT(TRANSPOSE(X),X))

),

MMULT(TRANSPOSE(X),Y)

)-COUNT(Y)*AVERAGE(Y)^2))/SUMPRODUCT(Y,Y)

If this also gives a negative result, show us your data (if it's not too

large).

Sep 26, 2001, 9:39:44 AM9/26/01

to

LINEST() (also SLOPE(), INTERCEPT(), VAR(), STDEV(), LOGEST(), TREND(),

FORECAST(), etc.) uses a numerically unstable algorithm. With

challenging data (such as yours), rounding error has accumulated to the

point that none of its calculations (slope, intercept, etc.) can be

believed. In your case, you were lucky enough to get an impossible R^2,

so that it was obvious that there was a problem. There may still be a

problem even with data that give more reasonable R^2 values. These

problems with Excel's algorithms have been well documented for years

(cf. Sawitzki, 1994, "Report on the reliability of data analysis

systems" Comput. Statist. Data Anal. 18:289-301) yet Microsoft continues

to ignore them.

FORECAST(), etc.) uses a numerically unstable algorithm. With

challenging data (such as yours), rounding error has accumulated to the

point that none of its calculations (slope, intercept, etc.) can be

believed. In your case, you were lucky enough to get an impossible R^2,

so that it was obvious that there was a problem. There may still be a

problem even with data that give more reasonable R^2 values. These

problems with Excel's algorithms have been well documented for years

(cf. Sawitzki, 1994, "Report on the reliability of data analysis

systems" Comput. Statist. Data Anal. 18:289-301) yet Microsoft continues

to ignore them.

Harlan Grove's matrix formulation simply recreates the same problem.

DEVSQ(), COVAR(), and CORREL() are the only 2nd moment functions in

Excel that are numerically reliable. For simple linear regression, use

the following formulas instead of LINEST(), SLOPE(), INTERCEPT(), RSQ(),

STEYX(), etc.

slope = COVAR(y,x)/DEVSQ(x)*COUNT(y)

intercept = AVERAGE(y) - slope*AVERAGE(x)

rsq = CORREL(y,x)^2

SSreg = rsq*DEVSQ(y)

SSresid = (1-rsq)*DEVSQ(y)

df = COUNT(y)-2

F = SSreg/SSresid*df

steyx = SQRT(SSresid/df)

se1 = steyx/SQRT(DEVSQ(x))

seb = steyx/SQRT(1/COUNT(y)+AVERAGE(x)^2/DEVSQ(x))

This approach has the added advantage over LINEST that it allows missing

values in the data range. However that cuts both ways, because they

will give a wrong answer if there are data pairs where only x or y (but

not both) are missing.

Similarly, for univariate statistics use the following formulas instead

of VAR(), VARP(), STDEV(), and STDEVP()

var = DEVSQ(x)/(COUNT(x)-1)

varp = DEVSQ(x)/COUNT(x)

stdev = SQRT(var)

stdevp = SQRT(varp)

Since Microsoft has already programmed routines that would be superior

to their unstable routines, it is puzzling why they continue to maintin

redundant inferior code. The unstable formulas that Excel programed are

mathematically exact (with infinite precision), so my formulas will

agree with the Excel functions for non-challenging data sets. When they

disagree, the dedicated Excel functions are wrong.

There is no DEVSQA function, there is no hel for VARA(), VARPA(),

STDEVA(), or STDEVPA() other than doing those calculations manually.

If you are wedded to using LINEST(), then test to see if

STDEV(x) = SQRT(DEVSQ(x)/COUNT(x))

STDEV(y) = SQRT(DEVSQ(y)/COUNT(y))

PEARSON(y,x) = CORREL(y,x)

If all three of these are approximately true (say to at least 12 figures

each), then LINEST() can probably be believed for simple linear

regression. Figuring out when LINEST() can be believed for more complex

models is not so simple.

Jerry

Sep 26, 2001, 5:14:41 PM9/26/01

to

Jerry W. Lewis <JWLe...@mediaone.net> wrote...

...

>Harlan Grove's matrix formulation simply recreates the same problem.

>

>DEVSQ(), COVAR(), and CORREL() are the only 2nd moment functions in

>Excel that are numerically reliable. For simple linear regression, use

>the following formulas instead of LINEST(), SLOPE(), INTERCEPT(), RSQ(),

>STEYX(), etc.

>

>slope = COVAR(y,x)/DEVSQ(x)*COUNT(y)

>intercept = AVERAGE(y) - slope*AVERAGE(x)

>rsq = CORREL(y,x)^2

>SSreg = rsq*DEVSQ(y)

>SSresid = (1-rsq)*DEVSQ(y)

>df = COUNT(y)-2

>F = SSreg/SSresid*df

>steyx = SQRT(SSresid/df)

>se1 = steyx/SQRT(DEVSQ(x))

>seb = steyx/SQRT(1/COUNT(y)+AVERAGE(x)^2/DEVSQ(x))

...

...

>Harlan Grove's matrix formulation simply recreates the same problem.

>

>DEVSQ(), COVAR(), and CORREL() are the only 2nd moment functions in

>Excel that are numerically reliable. For simple linear regression, use

>the following formulas instead of LINEST(), SLOPE(), INTERCEPT(), RSQ(),

>STEYX(), etc.

>

>slope = COVAR(y,x)/DEVSQ(x)*COUNT(y)

>intercept = AVERAGE(y) - slope*AVERAGE(x)

>rsq = CORREL(y,x)^2

>SSreg = rsq*DEVSQ(y)

>SSresid = (1-rsq)*DEVSQ(y)

>df = COUNT(y)-2

>F = SSreg/SSresid*df

>steyx = SQRT(SSresid/df)

>se1 = steyx/SQRT(DEVSQ(x))

>seb = steyx/SQRT(1/COUNT(y)+AVERAGE(x)^2/DEVSQ(x))

I won't dispute that the direct calculation may have a problem with

precision. The matrix multiplies should be no big deal, but the matrix

inversion could very well be a big problem.

The formulas above (and the others offered) are OK for univariate analysis,

but they're no good for multivariate analysis. If there's more than one

independent variable, there's no practical alternative to matrix methods.

The only realistic conclusion to be drawn is that Excel just isn't adequate

for serious statistical analysis. No doubt there are superior ways to

calculate (X' X)^(-1) X', the heart of matrix-based multivariate linear

regression, than just plodding through the matrix operations one at a time.

This also calls into question the accuracy of ANY calculations using Excel's

MINVERSE on large matrices, especially sparse ones. Maybe Longre's

MINVERSE.EXT add-in function performs better with respect to loss of

precision.

Still, I'm curious to see what the OP comes up with as a result from the

direct matrix formula for R-squared.

Sep 26, 2001, 9:10:24 PM9/26/01

to

The problem is summing squared values before the final subtraction in

Sum(x^2)-Sum(x)^2/2

for VAR(), STDEV(), etc., and in

Sum(x*y)-Sum(x)*Sum(y)/2

for the numerator of SLOPE(), PEARSON(), RSQ(), etc. The matrix

formulation still saves that subtraction for last, so my criticism is of

the algorithm, not the accuracy of MINVERSE().

Sum(x^2)-Sum(x)^2/2

for VAR(), STDEV(), etc., and in

Sum(x*y)-Sum(x)*Sum(y)/2

for the numerator of SLOPE(), PEARSON(), RSQ(), etc. The matrix

formulation still saves that subtraction for last, so my criticism is of

the algorithm, not the accuracy of MINVERSE().

With large data sets, or with large numbers that only change in low

order bits, most or even all of the relevant information may be lost

beyond the 15 figure accuracy of Excel's IEEE double-precision floating

point arithmatic. By first calculating the means and then squaring the

deviations from means in DEVSQ(), COVAR(), and CORREL(), rounding

problems are nearly eliminated.

For example, VAR(x,x+1) should be 0.5 regardless of x, but if x = 10^8,

then VAR() and the matrix approach must carry at least 18 figures to

give the correct answer while DEVSQ() only needs 9 figures to do it.

Jerry

Sep 26, 2001, 11:32:26 PM9/26/01

to

"Jerry W. Lewis" <JWLe...@mediaone.net> wrote...

>The problem is summing squared values before the final subtraction in

> Sum(x^2)-Sum(x)^2/2

>for VAR(), STDEV(), etc., and in

> Sum(x*y)-Sum(x)*Sum(y)/2

>for the numerator of SLOPE(), PEARSON(), RSQ(), etc. The matrix

>formulation still saves that subtraction for last, so my criticism is of

>the algorithm, not the accuracy of MINVERSE().

> Sum(x^2)-Sum(x)^2/2

>for VAR(), STDEV(), etc., and in

> Sum(x*y)-Sum(x)*Sum(y)/2

>for the numerator of SLOPE(), PEARSON(), RSQ(), etc. The matrix

>formulation still saves that subtraction for last, so my criticism is of

>the algorithm, not the accuracy of MINVERSE().

Truncation error! OK, the matrix formula I gave shortens to

R^2 = (Y' X (X' X)^(-1) X' Y - n y_bar^2) / Y' Y

Let U be a column vector of length n consisting of all ones. Then

y_bar^2 = Y' U U' Y / n^2

and

n y_bar^2 = Y' U U' Y / n

and letting W = U / n^(1/2),

n y_bar^2 = Y' W W' Y

so

R^2 = (Y' X (X' X)^(-1) X' Y - Y' W W' Y) / Y' Y

Last time I checked, matrix multiplication was distributive over matrix

addition, so

R^2 = (Y' [X (X' X)^(-1) X' - W W'] Y') / Y' Y

putting the subtraction first rather than last. So create a defined name W

referring to the formula =SIGN(ROW(INDIRECT("1:"&COUNT(Y))))/SQRT(COUNT(Y)).

Then use the formula

=MMULT(

MMULT(

TRANSPOSE(Y),

MMULT(

MMULT(

X,

MINVERSE(MMULT(TRANSPOSE(X),X))

},

TRANSPOSE(X)

)-W*TRANSPOSE(W)

),

Y

)/SUMPRODUCT(Y,Y)

Is everyone happy now?

I just love people who point out flaws without offering to fix 'em.

Sep 27, 2001, 2:40:41 AM9/27/01

to

"Harlan Grove" <hrl...@aol.com> wrote...

...

> Then use the formula

>

>=MMULT(

> MMULT(

> TRANSPOSE(Y),

> MMULT(

> MMULT(

> X,

> MINVERSE(MMULT(TRANSPOSE(X),X))

> },

> TRANSPOSE(X)

> )-W*TRANSPOSE(W)

> ),

> Y

>)/SUMPRODUCT(Y,Y)

...

...

> Then use the formula

>

>=MMULT(

> MMULT(

> TRANSPOSE(Y),

> MMULT(

> MMULT(

> X,

> MINVERSE(MMULT(TRANSPOSE(X),X))

> },

> TRANSPOSE(X)

> )-W*TRANSPOSE(W)

> ),

> Y

>)/SUMPRODUCT(Y,Y)

Oops! Not just typos, I also needed to remove the mean from the denominator

term. Might as well simplify things. Eliminate the defined name W -

unnecessary - and use the array formula

=MMULT(

TRANSPOSE(y),

MMULT(

MMULT(

x,

MMULT(

MINVERSE(MMULT(TRANSPOSE(x),x)),

TRANSPOSE(x)

)

)-1/n,

y

)

)/DEVSQ(y)

This _doesn't_ handle missing values, but it _should_ handle truncation

error.

Sep 27, 2001, 2:45:49 AM9/27/01

to

Man, what a refreshing thread! Nice to know I'm not alone in

frustration over the basic calcs.

frustration over the basic calcs.

It gets worse, sad to say. The binomial distribution is in bad shape;

take a simple case where we want the density (with respect to counting

measure) for a single flip of a coin coming up "heads":

=BINOMDIST(0,1,0.5,FALSE)

returns 0.5, as it should. Replace the first parameter with 1, and

again we get 0.5, as we should. Replace it with 0.3, say, and we

*should* get 0, not 0.5; replace it with any number less than 0, or

greater than 1, and we *should* get 0, not #NUM!

Now consider the distribution

=BINOMDIST(x,1,0.5,TRUE)

All is OK for 0 <= x <= 1, but the above should return 0 for x < 0, and

1 for x >= 1, not #NUM!.

Now try this:

=BINOMDIST(600,1200,0.5,TRUE)

We should get 0.5, not #NUM; seems MS's algorithm is trying to do a

combinatoric straight out, instead of using the incomplete beta

(already implemented) to do the calcs.

All of the other discrete distributions suffer similar problems; the

continuous ones have their own problems as well, if you look at them a

bit.

I can't help but wonder what software these folks use for financial

analysis. Surely it can't be Excel as it comes out of the box.

Good news: the NORMSINV approximation is fixed with the latest versions

of Excel (Mac is in beta still, but XP is out). So these folks may

well respond to concerns, albeit slowly.

regards,

dave braden

In article <3BB27CC7...@mediaone.net>, Jerry W. Lewis

<JWLe...@mediaone.net> wrote:

--

E-mail (ROT-13'ed): qoe...@ebpurfgre.ee.pbz

Sep 27, 2001, 4:54:52 AM9/27/01

to

"David J. Braden" <no...@ugotta.bekidding.com> wrote...

...

>I can't help but wonder what software these folks use for financial

>analysis. Surely it can't be Excel as it comes out of the box.

...

>I can't help but wonder what software these folks use for financial

>analysis. Surely it can't be Excel as it comes out of the box.

<cynical rant!>

Once they own EVERYTHING, they won't need finanical software.

Why do you think they wanted to get Quicken?

</cynical rant!>

>Good news: the NORMSINV approximation is fixed with the latest versions

>of Excel (Mac is in beta still, but XP is out). So these folks may

>well respond to concerns, albeit slowly.

Added to colored worksheet tabs and full paths in headers/footers - a

compelling reason to spend a few hundred bucks on the upgrade. [Am I being

too sarcastic?]

Sep 27, 2001, 8:06:24 AM9/27/01

to

For linear regression, the X'X matrix is

n Sum(x)

Sum(x) Sum(x^2)

and the X'y vector is

Sum(y)

Sum(xy)

with challenging data sets the truncation occurs before you attempt to

invert anything.

Jerry

Sep 27, 2001, 11:15:05 AM9/27/01

to

Jerry W. Lewis <post_a_reply@no_e-mail.com> wrote...

>For linear regression, the X'X matrix is

>

> n Sum(x)

> Sum(x) Sum(x^2)

>For linear regression, the X'X matrix is

>

> n Sum(x)

> Sum(x) Sum(x^2)

It's only 2 by 2 - invert it.

| sum(x^2) -Sum(x) |

| -Sum(x) n |

--------------------------------

n Sum(x^2) - Sum(x)^2

Yes, that nasty denominator term has the truncation error. OK, now you've

got me going. I'm still irritated that you don't seem to think of offering

any sort of fix for this. I'd suspect the answer is analogous to replacing

the univariate variance formula

Sum(x^2)/n-Average(x)^2

with

Sum((x-Average(x))^2)/n

If X is an n by k matrix of the independent variables plus a column of 1s

for the constant term, then let Z be the n by k matrix in which each column

on all rows is the average of all rows in the corresponding column of X. In

other words, for U an n by 1 column vector of all 1s, Z = (U U' X) / n, or

in Excel terms, Z could be given by

=SIGN(ROW(x))*MMULT(TRANSPOSE(SIGN(ROW(x))),x)/ROWS(x)

Then use [(X - Z) + Z] in place of X.

(X' X)^(-1) X' y

= ([(X - Z) + Z]' [(X - Z) + Z])^(-1) [(X - Z) + Z]' Y

= ((X - Z)' (X - Z) + Z' (X - Z) + (X - Z)' Z + Z' Z)^(-1) ((X - Z)' Y + Z'

Y)

= ((X - Z)' (X - Z) + 2 X' Z - Z' Z)^(-1) ((X - Z)' Y + Z' Y)

At this point the Excel formula would get a bit long, but it'd make more

sense to wrap it in a udf anyway.

If this doesn't mitigate truncation error, what does?

Sep 27, 2001, 4:20:12 PM9/27/01

to

Oof, a bit too cynical, even for you. <g>

Nope, I have a glimmer of hope that things might be changing. I have

been writing them for many years now about the same persistent errors;

started in 1988 with STDEV and RANDBETWEEN (latter eventually fixed).

Seems a lot depends on who happens to be awake when we raise the

issues. And I don't know the best folks to target concerns to, though

I am sure as heck trying now through a current relationship with

Microsoft.

Nope, I have a glimmer of hope that things might be changing. I have

been writing them for many years now about the same persistent errors;

started in 1988 with STDEV and RANDBETWEEN (latter eventually fixed).

Seems a lot depends on who happens to be awake when we raise the

issues. And I don't know the best folks to target concerns to, though

I am sure as heck trying now through a current relationship with

Microsoft.

I'd suggest we just keep things alive in a polite way; Jerry mentioned

an article I didn't know of, but will have read by this weekend. There

must be a lot more out there like that; seems there was something

similar a few years back in "American Statistician". Why there are so

many disitinct algorithms within Excel to accomplish the same thing *in

principle* is puzzling (I have the routines for linear regression in

mind, among others); you'd think they would just do a single module

correctly.

Anyway, keep the faith. No question they have very talented coders, MS

just needs a willingness to acknowledge they made, and make, mistakes,

and are ready to act on feedback to correct them. *That* would be so

cool, and for me, show MS to be more attuned to customers than they

have been of yet. They are still aloof, when it comes to Excel, but

may be changing, at least if the Mac side of the business is any

indication.

regards,

dave braden

In article <OVBFnHzRBHA.1528@tkmsftngp05>, Harlan Grove

<hrl...@aol.com> wrote:

--

E-mail (ROT-13'ed): qoe...@ebpurfgre.ee.pbz

Sep 27, 2001, 4:26:14 PM9/27/01

to

Jerry,

Lots of hot air so far; what have you got? Spell it out for us, please;

I'm admittedly slow, and would appreciate less obtruse comments in

regards to this stuff.

Lots of hot air so far; what have you got? Spell it out for us, please;

I'm admittedly slow, and would appreciate less obtruse comments in

regards to this stuff.

dave braden

Sep 28, 2001, 9:04:17 AM9/28/01

to

Here are some references (including for completeness the one I sited

before):

before):

- Sawitzki (1994) "Report on the reliability of data analysis

systems" Comput. Statist. Data Anal. 18:289-301

- Knüsel (1998) "On the accuracy of statistical distributions in

Microsoft Excel 97" Comput. Statist. Data Anal. 26:375-377

- McCullough & Wilson (1999) "On the accuracy of statistical

procedures in Microsoft Excel 97" Comput. Statist. Data Anal. 31:27-37.

These are the only things that Current Index to Statistics turns up for

the keyword "Excel" other than books and their reviews on doing

statistics with Excel (the books generally ignore the computational

problems, I can't say that I have read many of the reviews)

I don't recall the American Statistician article that you mention, but

when I get to the office I will start by checking out

- McCullough (1998) "Assessing the reliability of statistical

software: Part I" Amer. Statist. 52:358-366.

- McCullough (1999) "Assessing the reliability of statistical

software: Part II" Amer. Statist. 53:149-159.

John McKenzie of Babson College gave the following talk last month at

the 53rd ISI in Seoul Korea

- McKenzie (2001) "A critical review of how well reference manuals and

textbooks inform the reader about concerns using Microsoft Excel 2000

for statistical analysis"

Proceedings will be published, but details are sketchy on the meeting

web site

These references deal specifically with Excel. The algorithms and

criticisms of their numerical properties are much older than Microsoft.

I would be very interested in receiving other references.

Jerry

David J. Braden wrote:

> ...

Sep 28, 2001, 11:18:30 AM9/28/01

to

Jerry,

I can't get through to you by private e-mail. Could you please contact me at

bra...@rochester.rr.com?

I can't get through to you by private e-mail. Could you please contact me at

bra...@rochester.rr.com?

TIA

Dave braden

in article 3BB4759E.3080707@no_e-mail.com, apparently Jerry W. Lewis at

post_a_reply@no_e-mail.com posted on 9/28/01 9:04 AM:

Sep 28, 2001, 12:31:40 PM9/28/01

to

Jerry W. Lewis <post_a_reply@no_e-mail.com> wrote...

...>I don't recall the American Statistician article that you mention, but

>when I get to the office I will start by checking out

>

> - McCullough (1998) "Assessing the reliability of statistical

>software: Part I" Amer. Statist. 52:358-366.

>

> - McCullough (1999) "Assessing the reliability of statistical

>software: Part II" Amer. Statist. 53:149-159.

>These references deal specifically with Excel. The algorithms and

>criticisms of their numerical properties are much older than Microsoft.

Being blunt and directly to the point, what is the preferred method for

calculating multivariate linear regression coefficients? There must be some

best method to minimize truncation error while still using matrix methods.

What is it?

Criticism is fine. Saying 'Excel's built-in functions do that poorly' is

fine. But at some point reciting what Excel can't do ceases to be valuable.

At that point it's time to address work-arounds. You haven't. Either you

don't want to help the OP _solve_ the problem, or you don't know how to

solve it. Which is it?

Sep 28, 2001, 6:16:16 PM9/28/01

to

Harlan,

You crack me up. For *you*, of all people, to preface a remark as being

"blunt" is hilarious <vbg>. This thread has three oversized egos at work;

so, let's put our heads to it. I really admire your spirit and talent.

You crack me up. For *you*, of all people, to preface a remark as being

"blunt" is hilarious <vbg>. This thread has three oversized egos at work;

so, let's put our heads to it. I really admire your spirit and talent.

I'm going to start a new thread on distributions; I'll take them one-by-one,

down to the algorithm most appropriate for the distribution. First pass will

be on handling arguments; I would *really* appreciate your comments.

This is going to be, in a sense, "open-source" coding. There is a strong

engineering community here, as well as the math riff-raff like you <vbg>.

Let's just take the bull by the horns, OK?

Regards,

Dave Braden

See new thread shortly entitled: "Discrete Distributions in Excel"

in article OQmEOSFSBHA.1972@tkmsftngp05, apparently Harlan Grove at

Hrl...@aol.com posted on 9/28/01 12:31 PM:

Sep 28, 2001, 4:53:25 PM9/28/01

to

David J. Braden <dbr...@rochester.rr.com> wrote...

>You crack me up. For *you*, of all people, to preface a remark as being

>"blunt" is hilarious <vbg>. . . .

>You crack me up. For *you*, of all people, to preface a remark as being

OK, I take your point.

However, I'm not too concerned with discrete distributions since they're not

relevant to the residuals from linear regression.

So . . . do you know the maximum precision algorithm for calculating

regression coefficients?

Sep 28, 2001, 8:05:27 PM9/28/01

to

Nope, I sure don't. But it's certainly worth a search for; not sure actually

that I would seek "maximum precision" as much as a solution that does better

than what we currently have; could also be that one would have to carry

around axtra doubles for synthetic division. Clueless on this one. But I

certainly *do* wish there was internal consistency among all the regression

routines in Excel.

that I would seek "maximum precision" as much as a solution that does better

than what we currently have; could also be that one would have to carry

around axtra doubles for synthetic division. Clueless on this one. But I

certainly *do* wish there was internal consistency among all the regression

routines in Excel.

Regards,

Dave BRaden

in article uA9oUkHSBHA.1368@tkmsftngp04, apparently Harlan Grove at

Hrl...@aol.com posted on 9/28/01 4:53 PM:

Sep 29, 2001, 2:43:55 PM9/29/01

to

Harlan Grove wrote:

> Being blunt and directly to the point, what is the preferred method for

> calculating multivariate linear regression coefficients? There must be some

> best method to minimize truncation error while still using matrix methods.

> What is it?

I would have to research the topic to give an informed answer. For the

background you might start with Kennedy and Gentle's book (Statistical

Computing, 1980, Marcel Dekker). For code, you might start with

algorithms published in Applied Statistics, most of which can be found

on Statlib.

I have looked at the basic statistical calculations in Excel because

there is value added in teaching my clients to do some of that for

themselves. For more challenging work I use dedicated statistical

packages where numerical issues have already been addressed by the

developers. Since David seems to think that MS might actually listen

this time, I might be willing to spend some time on it.

> Criticism is fine. Saying 'Excel's built-in functions do that poorly' is

> fine. But at some point reciting what Excel can't do ceases to be valuable.

> At that point it's time to address work-arounds. You haven't.

Actually this thread began on simple linear regression, and suggesting

workarounds by choosing Excel functions that do work well is exactly how

I entered this thread.

Jerry

Sep 29, 2001, 3:15:48 PM9/29/01

to

>Actually this thread began on simple linear regression, and suggesting

>workarounds by choosing Excel functions that do work well is exactly how

>I entered this thread.

>workarounds by choosing Excel functions that do work well is exactly how

>I entered this thread.

Yes, and I responded that they were only appropriate for problems with a

single independent variable, and from that point the discussion focussed on

matrix methods, and you haven't constributed anything positive since then.

If the OP has a single independent variable, this is a moot discussion -

your alternatie formulas suffice. On the other hand, if the OP has multiple

independent variables, then something more is needed. If truncation error is

the primary problem, I'd expect the refinement to the standard matrix

formula I offered a day or two ago would suffice for all but the most

pathological data.

Oct 1, 2001, 12:30:29 PM10/1/01

to

>Now try this:

>=BINOMDIST(600,1200,0.5,TRUE)

>

>We should get 0.5, not #NUM; seems MS's algorithm is trying to do a

>combinatoric straight out, instead of using the incomplete beta

>(already implemented) to do the calcs.

...>=BINOMDIST(600,1200,0.5,TRUE)

>

>We should get 0.5, not #NUM; seems MS's algorithm is trying to do a

>combinatoric straight out, instead of using the incomplete beta

>(already implemented) to do the calcs.

OK, I give up on the rest, but you're wrong here. It shouldn't be #NUM!, but

it's not 0.5. You need an odd n parameter and x = INT(n / 2) to get a

cumulative value of 0.5, e.g., BINOMDIST(5,11,0.5,True) = 0.5,

BINOMDIST(6,12,0.5,True) > 0.5. x = 0 has positive probability, and that

means for n even the rv x can take on n+1 (and odd number of) states.

You get to be right in the large, and I get to poke holes in your examples.

Happy?

BTW, R has no problem with your example: pbinom(600,1200,0.5) = 0.5115141

and pbinom(600,1201,0.5) = 0.5. Which leads me to wonder whether a set of

wrapper functions around the R binary wouldn't be expedient. For that

matter, R's code is GPL'ed, so freely available.

Oct 1, 2001, 7:33:30 PM10/1/01

to

Harlan.

You are right; it shouldn't be 0.5. I meant to write "about 0.5". Sorry.

You are right; it shouldn't be 0.5. I meant to write "about 0.5". Sorry.

Now: what the heck is "R"?!? Sort of a S-Plus like language? Compilable to

something we can tie into Excel?

Keep poking the holes, please. I want to get this *right*.

Did you try what I posted? It's simple, seems to work over all longs, but I

haven't pushed it that much; it always dominates Excel's approach, but may

need help in pathologic cases.

Regards, and many, many thanks.

Dave Braden

in article uGkjc$qSBHA.460@tkmsftngp04, apparently Harlan Grove at

Hrl...@aol.com posted on 10/1/01 12:30 PM:

Oct 2, 2001, 2:13:26 AM10/2/01

to

David J. Braden wrote:

> Now: what the heck is "R"?!? Sort of a S-Plus like language?

S is a language developed at Bell Labs. S-Plus is a commercial

(Insightful Corp.) implementation of (currently) v4 of the S language.

R is a free implementation of v3 of the S language. Each has its own

extensions to the S language.

> Compilable to something we can tie into Excel?

Harlan?

Jerry

Oct 2, 2001, 2:30:21 AM10/2/01

to

"Jerry W. Lewis" <post_a_reply@no_e-mail.com> wrote...

>David J. Braden wrote:

...

...

>>Compilable to something we can tie into Excel?

>

>Harlan?

>

>Harlan?

I thought I stated that it was GPL'ed. The source code it freely available.

If nothing else, it could be lifted and translated into your language of

choice. Maybe a big translation job, but the 'wheel' may already be ready

for use rather than needing to be reinvented. But I'm certain that if we

came up with a derivative stats library based on GPL'ed code MS would have a

cow. So this must be a good idea!

Easy enough to check it out if you have a Win32 system.

Oct 5, 2001, 8:04:08 AM10/5/01

to

In article <3BB95B5E.4090706@no_e-mail.com>, Jerry W. Lewis

<post_a_reply@no_e-mail.com> wrote:

<post_a_reply@no_e-mail.com> wrote:

I know what S is. Have used S-Plus for years, though not of late, as I

no longer have a platform that will support it. I was trying to ask

what "R" is. Any idea? Also, what is GPL? General Public License?

TIA

Dave

Oct 5, 2001, 8:49:28 AM10/5/01

to

Look in the R section on Statlib

http://lib.stat.cmu.edu/R/CRAN/

There appear to be R versions for Mac OS 8, 9, & X

Jerry

Oct 5, 2001, 8:53:46 AM10/5/01

to

> . . . Also, what is GPL? General Public License?

Yes, though there's an implicit prefix. The full name is the GNU General

Public License. R is a GNU project.

Oct 5, 2001, 2:49:23 PM10/5/01

to

Thanks, Jerry! Will do.

In article <3BBDACB7.2050807@no_e-mail.com>, Jerry W. Lewis

<post_a_reply@no_e-mail.com> wrote:

> Look in the R section on Statlib

>

> http://lib.stat.cmu.edu/R/CRAN/

>

> There appear to be R versions for Mac OS 8, 9, & X

>

> Jerry

>

> David J. Braden wrote:

>

> > In article <3BB95B5E.4090706@no_e-mail.com>, Jerry W. Lewis

> > <post_a_reply@no_e-mail.com> wrote:

> >

> >

> >>David J. Braden wrote:

> >>

> >>

> >>>Now: what the heck is "R"?!? Sort of a S-Plus like language?

> >>>

> >>

> >>S is a language developed at Bell Labs. S-Plus is a commercial

> >>(Insightful Corp.) implementation of (currently) v4 of the S language.

> >>R is a free implementation of v3 of the S language. Each has its own

> >>extensions to the S language.

> >>

> >> > Compilable to something we can tie into Excel?

> >>

> >>Harlan?

> >>

> >

> > I know what S is. Have used S-Plus for years, though not of late, as I

> > no longer have a platform that will support it. I was trying to ask

> > what "R" is. Any idea? Also, what is GPL? General Public License?

> >

> > TIA

> > Dave

> >

> >

--

E-mail (ROT-13'ed): qoe...@ebpurfgre.ee.pbz

Oct 5, 2001, 2:53:48 PM10/5/01

to

Jerry,

This is being fostered by Jan. Excellent!!! Check out his own pages

when you get a chance. Mac guy to the core. <vbg>

Dave

In article <3BBDACB7.2050807@no_e-mail.com>, Jerry W. Lewis

<post_a_reply@no_e-mail.com> wrote:

> Look in the R section on Statlib

>

> http://lib.stat.cmu.edu/R/CRAN/

>

> There appear to be R versions for Mac OS 8, 9, & X

>

> Jerry

>

> David J. Braden wrote:

>

> > In article <3BB95B5E.4090706@no_e-mail.com>, Jerry W. Lewis

> > <post_a_reply@no_e-mail.com> wrote:

> >

> >

> >>David J. Braden wrote:

> >>

> >>

> >>>Now: what the heck is "R"?!? Sort of a S-Plus like language?

> >>>

> >>

> >>S is a language developed at Bell Labs. S-Plus is a commercial

> >>(Insightful Corp.) implementation of (currently) v4 of the S language.

> >>R is a free implementation of v3 of the S language. Each has its own

> >>extensions to the S language.

> >>

> >> > Compilable to something we can tie into Excel?

> >>

> >>Harlan?

> >>

> >

> > I know what S is. Have used S-Plus for years, though not of late, as I

> > no longer have a platform that will support it. I was trying to ask

> > what "R" is. Any idea? Also, what is GPL? General Public License?

> >

> > TIA

> > Dave

> >

> >

--

E-mail (ROT-13'ed): qoe...@ebpurfgre.ee.pbz

Oct 5, 2001, 3:14:23 PM10/5/01

to

Since I started this thread, maybe I can finish it.

R2 is a statistical value called the "Coefficient of

Determination" It is one of many provided when using the

MS function LINEST. It tells how good a linear regression

line can be predicted by the original data. So, If R2 is 1

or -1 the relationship is perfect. If the R2 is 0 there is

no relationship. R2 can never be greater than +/- 1. My

original data, using LINEST produced an R2 value of -1.18,

so I started this discussion.>.

>

R2 is a statistical value called the "Coefficient of

Determination" It is one of many provided when using the

MS function LINEST. It tells how good a linear regression

line can be predicted by the original data. So, If R2 is 1

or -1 the relationship is perfect. If the R2 is 0 there is

no relationship. R2 can never be greater than +/- 1. My

original data, using LINEST produced an R2 value of -1.18,

so I started this discussion.

>

Oct 5, 2001, 11:35:37 PM10/5/01

to

<rno...@mbagolf.com> wrote...

...

>R2 is a statistical value called the "Coefficient of

>Determination" It is one of many provided when using the

>MS function LINEST. It tells how good a linear regression

>line can be predicted by the original data. So, If R2 is 1

>or -1 the relationship is perfect. If the R2 is 0 there is

>no relationship. R2 can never be greater than +/- 1. My

>original data, using LINEST produced an R2 value of -1.18,

>so I started this discussion.

...

...

>R2 is a statistical value called the "Coefficient of

>Determination" It is one of many provided when using the

>MS function LINEST. It tells how good a linear regression

>line can be predicted by the original data. So, If R2 is 1

>or -1 the relationship is perfect. If the R2 is 0 there is

>no relationship. R2 can never be greater than +/- 1. My

>original data, using LINEST produced an R2 value of -1.18,

>so I started this discussion.

You're confusing R^2 with correlation. R^2 = ESS / TSS, explanatory sum of

squares divided by total sum of squares. The other piece is RSS, residual

sum of squares, and it fits in TSS = ESS + RSS. As a ratio of sums of

squares of real numbers (I'm assuming you're not performing linear

regression on complex data), R^2 _CAN'T_ be negative. Mathematically

impossible. An R^2 of 1.0 means you've got nonstochastic linear data. An R^2

of 0 means your data, stochastic or not, is perfectly nonlinear (e.g.,

fitting a full period sine wave).

What you've got is a very bad case of truncation error. How many data

points? What are the largest and smallest X values?

And the fundamental point of the thread is: Excel does stats poorly. If you

want to do stats well, you'll either need to roll your own stats procedures

in Excel or you'll have to use something else.

Oct 6, 2001, 9:01:34 AM10/6/01

to

Harlan Grove wrote:

> <rno...@mbagolf.com> wrote...

> ...

> And the fundamental point of the thread is: Excel does stats poorly. If you

> want to do stats well, you'll either need to roll your own stats procedures

> in Excel or you'll have to use something else.

That is overstating sligtly, but not much. The general user has a right

to expect that RSQ(), SLOPE(), LINEST(), etc. will give the answers that

the functions claim to give. As you correctly point out, there are

circumstances where their answers are completely wrong, and the average

user has no basis for recognizing such circumstances.

My original post in this thread noted that DEVSQ(), CORREL() and COVAR()

do a very good job of controlling rounding error. If MS had based their

VAR, STD, and regression function series on DEVSQ() CORREL() and COVAR()

(cf. simple relations in post) then this thread (and far too many

similar ones in the past) would never have occurred.

Hopefully MS will finally commit to producing a quality product and will

eliminate the use if algorithms that were known to be inferior long

before Excel was a gleam in Bill's eye.

Jerry

Oct 8, 2001, 3:00:44 AM10/8/01

to

Jerry W. Lewis <post_a_reply@no_e-mail.com> wrote...

...>Hopefully MS will finally commit to producing a quality product and will

>eliminate the use if algorithms that were known to be inferior long

>before Excel was a gleam in Bill's eye.

How long have the inferior Excel functions been around? How long have

articles in the statistical journals cataloged their failings?

The nice people in Redmond have thick hides and very large egos indeed. They

don't admit mistakes. Against the corporate culture. Indeed, I don't think

they'd acknowledge these as mistakes. Another thread about negative zero

resulting from QUOTIENT(-1,5) and not comparing equal to (positive) zero has

confirmed their thinking (at least for me). They want benchmark figures.

Benchmark figures don't include accuracy of calculation results. No points

for deducing that fast inaccurate algorithms would be favored over slower

accurate ones.

I just assumed that the Excel programmers were a bunch of CS degree holders

with no clue about statistics, so they understood that, e.g., E[X^2] -

E[X]^2 would take only one loop while E[(X-E[X])^2] would take two, so it

was nothing more than understanding how to count op codes while simply being

ignorant of precision loss.

The negative zero problem shows either willful ignorance (they bury their

heads when anyone mentions precision) or lack of concern about

precision/accuracy if it affects speed. Intel FPU hardware is ANSI/IEEE

754-compliant with regard to negative zero and positive zero testing equal.

Bizarrely, Excel isn't, which means that Excel, at least at times, isn't

using floating point hardware for numerical comparisons. Why? To shave

execution time. Do I think they know about the +0 <> -0 problem? No, I'll be

charitable and assume they were legitimately ignorant about this. The fix

required - ignoring the sign bits when all other bits are zero - would make

it more execution-efficient simply to use FPU compares. That would, however,

harm Excel's benchmarks, so I'll bet it's never fixed.

Actually, what really opened my eyes was QUOTIENT(-1,5) producing negative

zero in the first place. The formula =TRUNC(numerator/denominator) gives a

better result, giving positive zero when ABS(numerator) < ABS(denominator)

no matter what the respective signs are. So MS isn't using this rather

breathtakingly simple approach. No, they're almost certainly using a series

of low-level bit-wise operations. This goes beyond ignorance. This enters

the realm of gross negligence. At least they've covered their butts with

their disclaimers of any warranties. They really do mean it when they say

that Excel isn't warranted to be good for anything.

Oct 9, 2001, 7:10:08 PM10/9/01

to

In article <OcjVBW8TBHA.1880@tkmsftngp07>, Harlan Grove

<Hrl...@aol.com> wrote:

<Hrl...@aol.com> wrote:

> Jerry W. Lewis <post_a_reply@no_e-mail.com> wrote...

> ...

> >Hopefully MS will finally commit to producing a quality product and will

> >eliminate the use if algorithms that were known to be inferior long

> >before Excel was a gleam in Bill's eye.

>

> How long have the inferior Excel functions been around? How long have

> articles in the statistical journals cataloged their failings?

Well, from the beginning; and every new stat function introduced has

been poorly done. Not sure about how far back the stat journal articles

go, but I alerted them in 1985 for the first problem. With Version 5

we got a slew of new ones; again alerted them, to no avail. Similar

experience for Jerry Lewis.

>

> The nice people in Redmond have thick hides and very large egos indeed.

Actually Harlan, they *are* nice. This isn't the result of a single

person's will; it is an organizational issue; perplexing, to be sure,

but I encounter a tremendous amount of good will within MS.

> They don't admit mistakes. Against the corporate culture. Indeed, I

> don't think they'd acknowledge these as mistakes.

I tend to agree with you as far as the Excel math functions go. For

other aspects of the product line, they are superb; security patches

come to mind.

<<snip big rant>>

> At least they've covered their butts with

> their disclaimers of any warranties. They really do mean it when they say

> that Excel isn't warranted to be good for anything.

Sheesh Harlan, lighten up. This isn't a "dark force" to deal with; it's

a huge (dis-)organization we are bringing around. I strongly urge you

to consider where things can go from here, rather than focus so

bitterly on how they have gone.

We *will* see changes; their people are now alerted to the issues that

we raise, and are moving forward to rectifying them. I strongly urge

you to save the "me versus Big Brother" rant for another venue; let's

just work on cataloging problems we find, so we can get the job done.

OK by you?

Dave Braden

Dec 20, 2016, 3:13:20 AM12/20/16

to

Hi all

this is funny. Missing values are a problem with LINEST

However, the very Excel that does not allow for missing values, ignores them when plotting a polynomial function

Since I exactly need the regression parameters from a polynomial regression of the fomr y=a*X^3+b*X^2+c*X+d, then the question becomes, how I can *automatically* get out the parameters from the equation in the plot into a set of cells.

I would also need t-test and significance but that can be solved later

Thank you!

this is funny. Missing values are a problem with LINEST

However, the very Excel that does not allow for missing values, ignores them when plotting a polynomial function

Since I exactly need the regression parameters from a polynomial regression of the fomr y=a*X^3+b*X^2+c*X+d, then the question becomes, how I can *automatically* get out the parameters from the equation in the plot into a set of cells.

I would also need t-test and significance but that can be solved later

Thank you!

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu