from scipy import stats
x = [5.05, 6.75, 3.21, 2.66]
y = [1.65, 26.5, -5.93, 7.96]
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
5.3935773611970186gradient
-16.281127993087829intercept
0.72443514211849758r_value
0.52480627513624778r_value**2
3.6290901222878866std_err
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
For what it’s worth, using by the definition of standard error of the estimate in Crow, Davis, and Maxfield, 1960, Statistics Manual: Dover Publications (p. 156), the Excel function provides the “correct” standard error of the estimate. Using notation from Crow, Davis, and Maxfield:
import numpy as np
n = 4.0
x = np.array([5.05, 6.75, 3.21, 2.66])
y = np.array([1.65, 26.5, -5.93, 7.96])
x2 = x*x
y2 = y*y
s2x = (4.0*x2.sum()-x.sum()*x.sum())/(n*(n-1.0))
s2y = (4.0*y2.sum()-y.sum()*y.sum())/(n*(n-1.0))
xy = x * y
b = (4.0*xy.sum()-x.sum()*y.sum())/(4.0*x2.sum()-x.sum()*x.sum())
a = (y.sum()-b*x.sum())/n
s2xy = ((n-1.0)/(n-2.0))*(s2y-b*b*s2x)
ste = np.sqrt(s2xy)
r=b*np.sqrt(s2x)/np.sqrt(s2y)
print "intercept: ",a
print "gradient (slope): ",b
print "correlation coefficient, r: ",r
print "std err est: ",ste
Produces the output :
intercept: -16.2811279931
gradient (slope): 5.3935773612
correlation coefficient, r: 0.724435142118
std err est: 11.6964144616
This same value for the standard error of the estimate is reported with the sample x,y data at the VassarStats, Statistical Computation Web Site, http://faculty.vassar.edu/lowry/VassarStats.html.
Brandon Nuttall, KRPG-1364
Kentucky Geological Survey
bnut...@uky.edu (KGS, Mo-We)
Brandon...@ky.gov (EEC, Th-Fr)
859-257-5500 ext 30544 (main)
859-323-0544 (direct)
859-684-7473 (cell)
859-257-1147 (FAX)
OK, I think I’ve figured it out.
The numpy covariance function doesn’t seem to return the actual sample variances (it returns a population variance?). What this means is that for the linregress() function in the stats.py source file, the quantity sterrest is not calculated correctly and needs to be adjusted to the sample variance. In addition, it includes the quantity ssxm, sum of squares for x (?) and I can’t find documentation for its inclusion.
# as implemented
# sterrest = np.sqrt((1-r*r)*ssym / ssxm / df)
# should be corrected to
sterrest = np.sqrt((1-r*r)*(ssym*n)/df)
Having made this correction, both the example provided and the example in Crow, Davis, and Maxfield (Table 6.1, p. 154) provide the same value for the standard error of the estimate and the value matches what is calculated by Excel.
I don’t know anything about SVN or submitting a correction, so someone will have to help me out or do it for me.
Thanks.
Brandon
Brandon Nuttall, KRPG-1364
Kentucky Geological Survey
bnut...@uky.edu (KGS, Mo-We)
Brandon...@ky.gov (EEC, Th-Fr)
859-257-5500 ext 30544 (main)
859-323-0544 (direct)
859-684-7473 (cell)
859-257-1147 (FAX)
From:
scipy-use...@scipy.org [mailto:scipy-use...@scipy.org] On Behalf
Of josef...@gmail.com
Sent: Sunday, January 10, 2010 8:41 PM
To: SciPy Users List
Subject: Re: [SciPy-User] StdErr Problem with Gary Strangman's
linregress function
On Sun, Jan 10, 2010 at 8:21 PM, Bruce Southey <bsou...@gmail.com> wrote:
This same value for the standard error of the estimate is reported with the sample x,y data at the VassarStats, Statistical Computation Web Site,http://faculty.vassar.edu/lowry/VassarStats.html.
<image001.png>
Anybody knows what's going on? Any alternative way of getting the standard error without going to R?
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
The Excel help is rather cryptic by :"Returns the standard error of the predicted y-value for each x in the regression. The standard error is a measure of the amount of error in the prediction of y for an individual x." But clearly this is not the same as the standard error of the 'gradient' (slope) returned by linregress. Without checking the formula, STEYX appears returns the square root what most people call the mean square error (MSE).
Bruce
If the old version of linregress matched excel, as you say, then I
unintentionally changed the meaning of this value in response to a
previous bug report (see http://projects.scipy.org/scipy/ticket/874 )
It's sometimes difficult to figure out what a value is supposed to be,
if there are neither sufficient documentation nor tests for it. I had
the numbers of linregress verified against statsmodels, but the
standard error just means something different than the definition in
excel.
But as Skipper said, for all but the simplest regression case,
scikits.statsmodels is much more general and produces more results.
Josef
Tom