Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Statistics Functions

7 views
Skip to first unread message

Jerry W. Lewis

unread,
Oct 2, 2001, 12:26:33 AM10/2/01
to
This is an outgrowth of the thread
LINEST with r2 = -1.18 ???
from microsoft.public.excel.worksheet.functions and the resulting
project that David Braden started in this group under the subjects
Discrete Distributions
Normal Distribution
Binomial Distribution
where we are attempting to collect better algorithms for Excel's math
and stat functions, that hopefully MS will incorporate into the product.

As was noted in "LINEST with r2 = -1.18 ???", VAR, VARA, DVAR, VARP,
VARPA, DVARP, STDEV, STDEVA, and STDEVP are all numerically unstable
because they have to compute SUM(x^2) which can rapidly loose precision
(truncation error) for large n or for values that require a large number
of figures. The LINEST ... thread proposed workarounds that are based
on DEVSQ(), which is mathematically equivalent but much more numerically
stable than the other functions.

Similarly the LINEST ... thread noted that SLOPE, INTERCEPT, RSQ,
PEARSON, FORECAST, STEYX, LINEST, TREND, etc. have similar problems for
similar reasons. Workarounds based on COVAR(), CORREL(), and DEVSQ()
were proposed that were mathematically equivalent but much more
numerically stable.

It would be a refreshing change if MS would simply implement the
proposed workarounds (which have been known for years) in place of their
numerically unstable approaches (which have been criticized for years).
That would be an almost totally free improvement since they could just
replace poor existing code with calls to other existing code -- almost
zero input for coding, validation, etc.

If MS is really considering a deeper commitment to improvement to the
point of rewriting of functions beyond just replacing inferior functions
with calls to existing superior equivalents, then the following should
be considered:

While DEVSQ(), COVAR(), and CORREL() are far superior to the other
functions mentioned above, they are not as good as they could be.
DEVSQ() first calculates SUM(x) and divides it by COUNT(x) to get
AVERAGE(x). It happens much more rarely, but you can occasionally have
truncation error problems with SUM(x), particularly with large data sets.

The supposed advantage of VAR(), etc. is that they only require one pass
through the data where DEVSQ() requires two passes. This can translate
into longer computing time for DEVSQ() with large data sets. But the
time advantage is overshadowed by the fact that the answer from VAR(),
etc. can be completely wrong for large data sets due to uncontrolled
rounding/truncation error.

You can have the best of both worlds by taking an entirely different
approach. In the following pseudo-VBA code, the mean (instead of the
sum) and devsq (instead of sum x^2) are accumulated/updated for each
observation. That gives the DEVSQ() result with only one pass through
the data and with reduced potential for the truncation problems that can
occur with SUM(x).

mean = 0
devsq = 0
For Each x In data
n = n + 1
dev = x - mean
mean = mean + dev / n
devsq = devsq + dev * (x - mean)
Next

For bivariate data (correlation, regression, etc.) you can do the same
thing:

meanX = 0
meanY = 0
devsqX = 0
devsqY = 0
scXY = 0
For Each {x,y} In data
n = n + 1
devX = x - meanX
meanX = meanX + devX / n
devsqX = devsqX + devX * (x - meanX)
devY = y - meanY
meanY = meanY + devY / n
devsqY = devsqY + devY * (y - meanY)
scXY = scXY + devX * (y - meanY)
Next

After processing all the data, scXY = COVAR(x,y)*n but it was obtained
in one pass and with less potential for truncation error.

LINEST() (also LOGEST()) not only does simple linear regression, it can
also handle multiple regression (including polynomial regression). It
does so by calculating what are called the "normal equations",
X'X b = X'y
This is the matrix equivalent of calculating SUM(x^2) etc. on the way to
computing SLOPE(), etc. Just as you can solve simple linear regression
without computing SUM(x^2), so you can get least squares estimates for
the more general LINEST() problems without first computing X'X. The
method is called "orthogonal triangularization of X" and is discussed
starting on p.314 of the book "Statistical Computing" by Kennedy and
Gentle (and is probably in most other books on the subject as well).
Will someone take a stab at writing a general LINEST() replacement using
this approach?

While we are on the subject of LINEST(), we should probably look at its
statistics when const=FALSE. My vague recollection is that none of the
statistics output is right when you ask for a no-intercept model.

Jerry

0 new messages