single pass Pearson Correlation alg?

Greg Makowski

unread,

Nov 6, 2002, 3:54:02 PM11/6/02

to

Hello,
I thought I remembered that a single pass Pearson Correlation
algorithm existed, but I am having trouble finding it. Perhaps it
doesn't exist, and I was thinking of some other alg that was single
pass. I have checked my Numerical Recipes in C and a few other old
numerical methods texts that I have.

Q1) Where can I find pseudo-code or an alg specification for a single
pass of the data to calculate a Pearson Correlation?

Q2) If you know that a single pass version of this does not exist,
please let me know.

It would be great if any responses could be emailed to me at
greg.m...@jda.com.

Thank you for your time, Greg

Stan Brown

unread,

Nov 6, 2002, 4:33:46 PM11/6/02

to

[cc'd to previous poster; follow-ups in newsgroup suggested]

Greg Makowski <greg.m...@jda.com> wrote in sci.stat.math:

>Hello,
>I thought I remembered that a single pass Pearson Correlation
>algorithm existed, but I am having trouble finding it. Perhaps it
>doesn't exist, and I was thinking of some other alg that was single
>pass. I have checked my Numerical Recipes in C and a few other old
>numerical methods texts that I have.

If you are talking about the coefficient of linear correlation, it
is SS(xy) / sqrt(SS(x)*SS(y)). That is equivalent to
n*SUM(xy) - SUM(x)*SUM(y)
divided by the square root of
[n*SUM(x^2) - (SUM(x))^2] * [n*SUM(y^2) - (SUM(y))^2]

So your algorithm would be fairly simple:
SUMXY = SUMX = SUMX2 = SUMY = SUMY2 = 0.0
n = 0
while (read x, y) {
++n
SUMX += x
SUMX2 += x*x
SUMY += y
SUMY2 += y*y
SUMXY += x*y
}
Then compute the above formula and print the result.

>It would be great if any responses could be emailed to me at
>greg.m...@jda.com.

Is there some reason why you can't read responses in the group?
Other people than you might be interested, you know.

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com
"Thoroughness. I always tell my students, but they are
constitutionally averse to painstaking work."
-- Emma Thompson, in /Wit/ (2000)

Clint Cummins

unread,

Nov 6, 2002, 5:38:12 PM11/6/02

to

Stan Brown <qx1...@bigfoot.com> wrote:
>SS(xy) ... is equivalent to
>n*SUM(xy) - SUM(x)*SUM(y) ...
Mathematically, they are exactly the same, but
numerically, if the mean is large relative the the variance,
the second form of the calculation will not be as accurate
as the first. A classic example of this is to compute
the variance of
X = 10000000001 10000000002 10000000003 using double precision.
If you use the X*X form, the last digits get lost in the rounding
error.
A recursive type calculation can be used to update the
means, variances and covariances. It would avoid problems
when the mean is large, but it can still be inaccurate if the
data is bimodal and the observations appear in an unfavorable sequence.
It's simpler and generally more accurate to do the
computation in 2 passes. It isn't all that slow with modern hardware,
even for extremely large amounts of data.

Clint Cummins

David Jones

unread,

Nov 7, 2002, 5:32:26 AM11/7/02

to

"Clint Cummins" <cl...@Stanford.EDU> wrote in message
news:aqc5kk$of4$1...@news.Stanford.EDU...

> It's simpler and generally more accurate to do the
> computation in 2 passes. It isn't all that slow with modern
hardware,
> even for extremely large amounts of data.
>
> Clint Cummins

... but it may not suit the context in which the thing is required,
particularly for statistics of a large number of simulations where
individual values are not required for other purposes. A simple
old-style Fortran code follows. Call with NK=0 to initialise, then
NK=1,2,3, etc., finally with NK=negative of sample size to get usual
unbiased est of population covariance matrix. This version assumes
the sum of squares won't become too large to hold. Revise final step
to get whatever means, standard deviations, correlations are required.
If you only have a single pair, it would be worth simplifying the code
to remove the do-loops.
SUBROUTINE ACC0(X,NX,XM,XS,D,NK,NQX)
REAL X(NX),XM(NX),XS(NQX,NQX),D(NX)
C
IF(NK)1,2,3
1 FN=-NK-1
DO 10 J=1,NX
DO 10 I=1,J
XS(I,J)=XS(I,J)/FN
10 XS(J,I)=XS(I,J)
RETURN
2 DO 20 J=1,NX
XM(J)=0.0
DO 20 I=1,J
20 XS(I,J)=0.0
RETURN
3 FN=NK
FN1=FLOAT(NK-1)/FN
DO 31 J=1,NX
D(J)=X(J)-XM(J)
XM(J)=XM(J)+D(J)/FN
DO 31 I=1,J
31 XS(I,J)=XS(I,J)+FN1*D(J)*D(I)
RETURN
END

Alan Miller

unread,

Nov 7, 2002, 7:25:23 PM11/7/02

to

"Greg Makowski" <greg.m...@jda.com> wrote in message
news:cb942b6d.02110...@posting.google.com...

Greg,
Here is the single-pass algorithm for updating means and sums of squares and
products of deviations from the mean. It is fairly well-conditioned. You
should be able to translate this into Fortran, Pascal, Basic or whatever
language you choose.

n = 0
xbar = 0, ybar = 0
sxx = 0, sxy = 0, syy = 0
repeat
read x, y
n = n + 1
devx = x - xbar, devy = y - ybar
xbar = xbar + devx/n, ybar = ybar + devy/n
sxx = sxx + devx*(x - xbar)
sxy = sxy + devx*(y - ybar)
syy = syy + devy*(y - ybar)
until end of data

The earliest reference to this algorithm which I have found is:
Jennrich, R.I. (1977) `Stepwise regression', pages 58-75 in the book
`Statistical methods for digital computers' edited by Enslein, Ralston &
Wilf, and published by Wiley.

Robert Jennrich was one of the principal people behind the BMDP statistical
package. I think he was at UCLA.

Cheers

G Robin Edwards

unread,

Nov 10, 2002, 3:39:08 PM11/10/02

to

In article <TvDy9.25077$5u4....@news-server.bigpond.net.au>,

I have always used this type of technique for computing means and
standard deviations, not necessarily because it saves computation time
but because it is a relatively accurate method too. It is particularly
valuable when the data are "stiff" - that is when they vary only in the
far right places of the observation. I thought that everyone would be
using similar updating methods, even the manufacturers of small
calculators!

--
Robin Edwards ZFC W Serious Statistical Software
REAL Statistics with Graphics for RISC OS machines
Please email S...@argonet.co.uk for details of our loan software.

Elliot Cramer

unread,

Nov 12, 2002, 12:32:49 AM11/12/02

to

In sci.stat.math Alan Miller <ami...@bigpond.net.au> wrote:
: "Greg Makowski" <greg.m...@jda.com> wrote in message

: news:cb942b6d.02110...@posting.google.com...
:> Hello,
:> I thought I remembered that a single pass Pearson Correlation

see
Youngs, E. A., & Cramer, E. M. (1971). Some results relevant to choice of
sum and sum-of-product algorithms. Technometrics, 13, 657-665.

Elliot Cramer

unread,

Nov 12, 2002, 12:56:42 AM11/12/02

to

In sci.stat.math Elliot Cramer <cra...@unc.edu> wrote:

this is the Fortran implementation

OBS = 0.0
DO 330 J = 1,NVART
CELLM(J) = 0.0
DO 330 K = 1,J
330 SSWITH(K,J) = 0.
C
C
DO 360 K = 1,NVART
CELLM(K) = CELLM(K) + X(K)
IF (OBS .EQ. 1.0) GO TO 360
DIFF1 = OBS*X(K)-CELLM(K)
DIFF2 = DIFF1/(OBS*(OBS-1.))
DO 355 L = 1,K
C COMPUTE DIRECT SUM OF DEVIATIONS
PROD = (OBS*X(L)-CELLM(L))*DIFF2
355 SSWITH(L,K) = SSWITH(L,K) + PROD
360 CONTINUE

Clint Cummins

unread,

Nov 12, 2002, 6:06:59 PM11/12/02

to

The above code cannot work, because OBS is
never incremented.
It would simply yield a divide by zero....

Clint Cummins

Elliot Cramer

unread,

Nov 12, 2002, 11:48:44 PM11/12/02

to

In sci.stat.math Clint Cummins <cl...@stanford.edu> wrote:

:>
:>
:>: see

:>: Youngs, E. A., & Cramer, E. M. (1971). Some results relevant to choice of
:>: sum and sum-of-product algorithms. Technometrics, 13, 657-665.
:>
:>this is the Fortran implementation

sorry, the was an OBS = OSb+1 that got left out and an goto and test for
end. (+ alot of other stuff that is not relevant here)

:>
:> OBS = 0.0