Q1) Where can I find pseudo-code or an alg specification for a single
pass of the data to calculate a Pearson Correlation?
Q2) If you know that a single pass version of this does not exist,
please let me know.
It would be great if any responses could be emailed to me at
greg.m...@jda.com.
Thank you for your time, Greg
Greg Makowski <greg.m...@jda.com> wrote in sci.stat.math:
>Hello,
>I thought I remembered that a single pass Pearson Correlation
>algorithm existed, but I am having trouble finding it. Perhaps it
>doesn't exist, and I was thinking of some other alg that was single
>pass. I have checked my Numerical Recipes in C and a few other old
>numerical methods texts that I have.
If you are talking about the coefficient of linear correlation, it
is SS(xy) / sqrt(SS(x)*SS(y)). That is equivalent to
n*SUM(xy) - SUM(x)*SUM(y)
divided by the square root of
[n*SUM(x^2) - (SUM(x))^2] * [n*SUM(y^2) - (SUM(y))^2]
So your algorithm would be fairly simple:
SUMXY = SUMX = SUMX2 = SUMY = SUMY2 = 0.0
n = 0
while (read x, y) {
++n
SUMX += x
SUMX2 += x*x
SUMY += y
SUMY2 += y*y
SUMXY += x*y
}
Then compute the above formula and print the result.
>It would be great if any responses could be emailed to me at
>greg.m...@jda.com.
Is there some reason why you can't read responses in the group?
Other people than you might be interested, you know.
--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com
"Thoroughness. I always tell my students, but they are
constitutionally averse to painstaking work."
-- Emma Thompson, in /Wit/ (2000)
Clint Cummins
... but it may not suit the context in which the thing is required,
particularly for statistics of a large number of simulations where
individual values are not required for other purposes. A simple
old-style Fortran code follows. Call with NK=0 to initialise, then
NK=1,2,3, etc., finally with NK=negative of sample size to get usual
unbiased est of population covariance matrix. This version assumes
the sum of squares won't become too large to hold. Revise final step
to get whatever means, standard deviations, correlations are required.
If you only have a single pair, it would be worth simplifying the code
to remove the do-loops.
SUBROUTINE ACC0(X,NX,XM,XS,D,NK,NQX)
REAL X(NX),XM(NX),XS(NQX,NQX),D(NX)
C
IF(NK)1,2,3
1 FN=-NK-1
DO 10 J=1,NX
DO 10 I=1,J
XS(I,J)=XS(I,J)/FN
10 XS(J,I)=XS(I,J)
RETURN
2 DO 20 J=1,NX
XM(J)=0.0
DO 20 I=1,J
20 XS(I,J)=0.0
RETURN
3 FN=NK
FN1=FLOAT(NK-1)/FN
DO 31 J=1,NX
D(J)=X(J)-XM(J)
XM(J)=XM(J)+D(J)/FN
DO 31 I=1,J
31 XS(I,J)=XS(I,J)+FN1*D(J)*D(I)
RETURN
END
Greg,
Here is the single-pass algorithm for updating means and sums of squares and
products of deviations from the mean. It is fairly well-conditioned. You
should be able to translate this into Fortran, Pascal, Basic or whatever
language you choose.
n = 0
xbar = 0, ybar = 0
sxx = 0, sxy = 0, syy = 0
repeat
read x, y
n = n + 1
devx = x - xbar, devy = y - ybar
xbar = xbar + devx/n, ybar = ybar + devy/n
sxx = sxx + devx*(x - xbar)
sxy = sxy + devx*(y - ybar)
syy = syy + devy*(y - ybar)
until end of data
The earliest reference to this algorithm which I have found is:
Jennrich, R.I. (1977) `Stepwise regression', pages 58-75 in the book
`Statistical methods for digital computers' edited by Enslein, Ralston &
Wilf, and published by Wiley.
Robert Jennrich was one of the principal people behind the BMDP statistical
package. I think he was at UCLA.
Cheers
I have always used this type of technique for computing means and
standard deviations, not necessarily because it saves computation time
but because it is a relatively accurate method too. It is particularly
valuable when the data are "stiff" - that is when they vary only in the
far right places of the observation. I thought that everyone would be
using similar updating methods, even the manufacturers of small
calculators!
--
Robin Edwards ZFC W Serious Statistical Software
REAL Statistics with Graphics for RISC OS machines
Please email S...@argonet.co.uk for details of our loan software.
see
Youngs, E. A., & Cramer, E. M. (1971). Some results relevant to choice of
sum and sum-of-product algorithms. Technometrics, 13, 657-665.
this is the Fortran implementation
OBS = 0.0
DO 330 J = 1,NVART
CELLM(J) = 0.0
DO 330 K = 1,J
330 SSWITH(K,J) = 0.
C
C
DO 360 K = 1,NVART
CELLM(K) = CELLM(K) + X(K)
IF (OBS .EQ. 1.0) GO TO 360
DIFF1 = OBS*X(K)-CELLM(K)
DIFF2 = DIFF1/(OBS*(OBS-1.))
DO 355 L = 1,K
C COMPUTE DIRECT SUM OF DEVIATIONS
PROD = (OBS*X(L)-CELLM(L))*DIFF2
355 SSWITH(L,K) = SSWITH(L,K) + PROD
360 CONTINUE
Clint Cummins
:>
:>
:>: see
:>: Youngs, E. A., & Cramer, E. M. (1971). Some results relevant to choice of
:>: sum and sum-of-product algorithms. Technometrics, 13, 657-665.
:>
:>this is the Fortran implementation
sorry, the was an OBS = OSb+1 that got left out and an goto and test for
end. (+ alot of other stuff that is not relevant here)
:>
:> OBS = 0.0