[SciPy-User] StdErr Problem with Gary Strangman's linregress function

308 views
Skip to first unread message

tota...@mac.com

unread,
Jan 10, 2010, 4:35:35 PM1/10/10
to SciPy Users List

Hello, Excel and scipy.stats.linregress are disagreeing on the standard error of a regression. 

I need to find the standard errors of a bunch of regressions, and prefer to use pure Python than RPy. So I am going to scipy.stats.linregress, as advised at:
http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/lin_reg/#linregress

from scipy import stats
x = [5.05, 6.75, 3.21, 2.66]
y = [1.65, 26.5, -5.93, 7.96]
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
gradient
5.3935773611970186
intercept
-16.281127993087829
r_value
0.72443514211849758
r_value**2
0.52480627513624778
std_err
3.6290901222878866


The problem is that the std error calculation does not agree with what is returned in Microsoft Excel's STEYX function (whereas all the other output does). From Excel:




Anybody knows what's going on? Any alternative way of getting the standard error without going to R?



Skipper Seabold

unread,
Jan 10, 2010, 4:59:43 PM1/10/10
to SciPy Users List

'std_err' is the standard error of 'gradient' above, not the standard error of the regression as reported in Excel.

You might want to have a look at the statsmodels scikit as a possible alternative to R.  I recommend getting the trunk source until the next release, which should be soon.

http://statsmodels.sourceforge.net/

Skipper



PastedGraphic-1.tiff

Bruce Southey

unread,
Jan 10, 2010, 8:21:17 PM1/10/10
to SciPy Users List
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user


The Excel help is rather cryptic by   :"Returns the standard error of the predicted y-value for each x in the regression. The standard error is a measure of the amount of error in the prediction of y for an individual x." But clearly this is not the same as the standard error of the 'gradient' (slope) returned by linregress. Without checking the formula, STEYX appears returns the square root what most people call the mean square error (MSE).

Bruce
PastedGraphic-1.tiff

josef...@gmail.com

unread,
Jan 10, 2010, 8:41:29 PM1/10/10
to SciPy Users List
>>> ((y-intercept-np.array(x)*gradient)**2).sum()/(4.-2.)
136.80611125682617
>>> np.sqrt(_)
11.6964144615701

I think this should be the estimate of the standard deviation of the noise/error term.

Josef
PastedGraphic-1.tiff

Nuttall, Brandon C

unread,
Jan 11, 2010, 10:47:44 AM1/11/10
to SciPy Users List

For what it’s worth, using by the definition of standard error of the estimate in Crow, Davis, and Maxfield, 1960, Statistics Manual: Dover Publications (p. 156), the Excel function provides the “correct” standard error of the estimate.  Using notation from Crow, Davis, and Maxfield:

 

import numpy as np

n = 4.0

x = np.array([5.05, 6.75, 3.21, 2.66])

y = np.array([1.65, 26.5, -5.93, 7.96])

x2 = x*x

y2 = y*y

s2x = (4.0*x2.sum()-x.sum()*x.sum())/(n*(n-1.0))

s2y = (4.0*y2.sum()-y.sum()*y.sum())/(n*(n-1.0))

xy = x * y

b = (4.0*xy.sum()-x.sum()*y.sum())/(4.0*x2.sum()-x.sum()*x.sum())

a = (y.sum()-b*x.sum())/n

s2xy = ((n-1.0)/(n-2.0))*(s2y-b*b*s2x)

ste = np.sqrt(s2xy)

r=b*np.sqrt(s2x)/np.sqrt(s2y)

print "intercept: ",a

print "gradient (slope): ",b

print "correlation coefficient, r: ",r

print "std err est: ",ste

 

Produces the output :

 

intercept:  -16.2811279931

gradient (slope):  5.3935773612

correlation coefficient, r:  0.724435142118

std err est:  11.6964144616

 

This same value for the standard error of the estimate is reported with the sample x,y data at the VassarStats, Statistical Computation Web Site, http://faculty.vassar.edu/lowry/VassarStats.html.

 

Brandon Nuttall, KRPG-1364

Kentucky Geological Survey

www.uky.edu/kgs

bnut...@uky.edu (KGS, Mo-We)

Brandon...@ky.gov (EEC, Th-Fr)

859-257-5500 ext 30544 (main)

859-323-0544 (direct)

859-684-7473 (cell)

859-257-1147 (FAX)

Nuttall, Brandon C

unread,
Jan 11, 2010, 3:07:02 PM1/11/10
to SciPy Users List

OK, I think I’ve figured it out.

 

The numpy covariance function doesn’t seem to return the actual sample variances (it returns a population variance?). What this means is that for the linregress() function in the stats.py source file, the quantity sterrest is not calculated correctly and needs to be adjusted to the sample variance. In addition, it includes the quantity ssxm, sum of squares for x (?) and I can’t find documentation for its inclusion.

 

# as implemented

# sterrest = np.sqrt((1-r*r)*ssym / ssxm / df)

# should be corrected to

sterrest = np.sqrt((1-r*r)*(ssym*n)/df)

 

Having made this correction, both the example provided and the example in Crow, Davis, and Maxfield (Table 6.1, p. 154) provide the same value for the standard error of the estimate and the value matches what is calculated by Excel.

 

I don’t know anything about SVN or submitting a correction, so someone will have to help me out or do it for me.

 

Thanks.

 

Brandon

 

Brandon Nuttall, KRPG-1364

Kentucky Geological Survey

www.uky.edu/kgs

bnut...@uky.edu (KGS, Mo-We)

Brandon...@ky.gov (EEC, Th-Fr)

859-257-5500 ext 30544 (main)

859-323-0544 (direct)

859-684-7473 (cell)

859-257-1147 (FAX)

 

From: scipy-use...@scipy.org [mailto:scipy-use...@scipy.org] On Behalf Of josef...@gmail.com
Sent: Sunday, January 10, 2010 8:41 PM
To: SciPy Users List
Subject: Re: [SciPy-User] StdErr Problem with Gary Strangman's linregress function

 

 

On Sun, Jan 10, 2010 at 8:21 PM, Bruce Southey <bsou...@gmail.com> wrote:

tota...@mac.com

unread,
Jan 11, 2010, 3:08:46 PM1/11/10
to SciPy Users List
Thanks very much to all who have helped with this. 

I am going to go with the first-principles formulae as per below. 

Otherwise I also asked on Stack Overflow and one person answered with a scikits example:



This same value for the standard error of the estimate is reported with the sample x,y data at the VassarStats, Statistical Computation Web Site,http://faculty.vassar.edu/lowry/VassarStats.html.

<image001.png>




Anybody knows what's going on? Any alternative way of getting the standard error without going to R?



_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user


The Excel help is rather cryptic by   :"Returns the standard error of the predicted y-value for each x in the regression. The standard error is a measure of the amount of error in the prediction of y for an individual x." But clearly this is not the same as the standard error of the 'gradient' (slope) returned by linregress. Without checking the formula, STEYX appears returns the square root what most people call the mean square error (MSE).

Bruce

josef...@gmail.com

unread,
Jan 11, 2010, 4:19:48 PM1/11/10
to SciPy Users List
On Mon, Jan 11, 2010 at 3:08 PM, <tota...@mac.com> wrote:
> Thanks very much to all who have helped with this.
> I am going to go with the first-principles formulae as per below.
> Otherwise I also asked on Stack Overflow and one person answered with a
> scikits example:
> http://stackoverflow.com/questions/2038667/scipy-linregress-function-erroneous-standard-error-return

If the old version of linregress matched excel, as you say, then I
unintentionally changed the meaning of this value in response to a
previous bug report (see http://projects.scipy.org/scipy/ticket/874 )

It's sometimes difficult to figure out what a value is supposed to be,
if there are neither sufficient documentation nor tests for it. I had
the numbers of linregress verified against statsmodels, but the
standard error just means something different than the definition in
excel.

But as Skipper said, for all but the simplest regression case,
scikits.statsmodels is much more general and produces more results.

Josef

tota...@mac.com

unread,
Jan 11, 2010, 4:34:19 PM1/11/10
to SciPy Users List
not a problem Josef. The new output was luckily wildly different enough so that "that something had happened" was easy to spot. Again thanks for all the help.

Tom

Reply all
Reply to author
Forward
0 new messages