[SciPy-User] Least Square fit and goodness of fit

1,654 views
Skip to first unread message

Benedikt Riedel

unread,
May 14, 2010, 3:01:06 PM5/14/10
to scipy...@scipy.org
Hey,

I am fairly new Scipy and am trying to do a least square fit to a set of data. Currently, I am using following code:

fitfunc = lambda p,x: p[0]+ p[1]*exp(-x)
errfunc = lambda p, x, y: (y-fitfunc(p,x))
pinit = [20,20.]
out = leastsq(errfunc, pinit, args=(tau,R4ctsdataselect), full_output=1)

I am now trying to get the goodness of fit out of this data. I am sort of running into a brick wall because I found a lot of conflicting ways of how to calculate it.

I am aware of the chisquare function in stats function, but the documentation seems a little confusing to me. Any help would be greatly appreciates.

Thanks very much in advance.

Cheers,

Ben


josef...@gmail.com

unread,
May 14, 2010, 3:51:29 PM5/14/10
to SciPy Users List
On Fri, May 14, 2010 at 3:01 PM, Benedikt Riedel <bri...@wisc.edu> wrote:
> Hey,
>
> I am fairly new Scipy and am trying to do a least square fit to a set of
> data. Currently, I am using following code:
>
> fitfunc = lambda p,x: p[0]+ p[1]*exp(-x)
> errfunc = lambda p, x, y: (y-fitfunc(p,x))
> pinit = [20,20.]
> out = leastsq(errfunc, pinit, args=(tau,R4ctsdataselect), full_output=1)
>
> I am now trying to get the goodness of fit out of this data. I am sort of
> running into a brick wall because I found a lot of conflicting ways of how
> to calculate it.

For regression the usual is
http://en.wikipedia.org/wiki/Coefficient_of_determination
coefficient of determination is

R^2 = 1 - {SS_{err} / SS_{tot}}

Note your fitfunc is linear in parameters and can be better estimated
by linear least squares, OLS.
linear regression is handled in statsmodels and you can get lot's of
statistics without worrying about the formulas.
If you only have one slope parameter, then scipy.stats.linregress also works

scipy.optimize.curve_fit (scipy 0.8) can also give you the covariance
of the parameter estimates.
http://docs.scipy.org/scipy/docs/scipy.optimize.minpack.curve_fit

> I am aware of the chisquare function in stats function, but the
> documentation seems a little confusing to me. Any help would be greatly
> appreciates.

chisquare and others like kolmogorov-smirnov are more for testing the
goodness-of-fit of entire distributions, not for how well a curve or
line fits the data.

Josef

>
> Thanks very much in advance.
>
> Cheers,
>
> Ben
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy...@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

--
You received this message because you are subscribed to the Google Groups "SciPy-user" group.
To post to this group, send email to scipy...@googlegroups.com.
To unsubscribe from this group, send email to scipy-user+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scipy-user?hl=en.

Benedikt Riedel

unread,
May 16, 2010, 12:12:21 AM5/16/10
to SciPy Users List
On Fri, May 14, 2010 at 14:51, <josef...@gmail.com> wrote:
On Fri, May 14, 2010 at 3:01 PM, Benedikt Riedel <bri...@wisc.edu> wrote:
> Hey,
>
> I am fairly new Scipy and am trying to do a least square fit to a set of
> data. Currently, I am using following code:
>
> fitfunc = lambda p,x: p[0]+ p[1]*exp(-x)
> errfunc = lambda p, x, y: (y-fitfunc(p,x))
> pinit = [20,20.]
> out = leastsq(errfunc, pinit, args=(tau,R4ctsdataselect), full_output=1)
>
> I am now trying to get the goodness of fit out of this data. I am sort of
> running into a brick wall because I found a lot of conflicting ways of how
> to calculate it.

For regression the usual is
http://en.wikipedia.org/wiki/Coefficient_of_determination
coefficient of determination is

   R^2 = 1 - {SS_{err} / SS_{tot}}

Note your fitfunc is linear in parameters and can be better estimated
by linear least squares, OLS.
linear regression is handled in statsmodels and you can get lot's of
statistics without worrying about the formulas.
If you only have one slope parameter, then scipy.stats.linregress also works


Thanks for the information. I am still note quite sure if this is what my boss wants because there should not be an average y value.
 
scipy.optimize.curve_fit (scipy 0.8) can also give you the covariance
of the parameter estimates.
http://docs.scipy.org/scipy/docs/scipy.optimize.minpack.curve_fit

I have been trying this out, but the fit just looks horrid compared to using leastsq method even though they call the same function according to the documentation.


> I am aware of the chisquare function in stats function, but the
> documentation seems a little confusing to me. Any help would be greatly
> appreciates.

chisquare and others like kolmogorov-smirnov are more for testing the
goodness-of-fit of entire distributions, not for how well a curve or
line fits the data.


That is what I thought, which brought up my confusion when I asked other people and they told me to use that
 
Josef

>
> Thanks very much in advance.
>
> Cheers,
>
> Ben
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy...@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user



--
Benedikt Riedel
Graduate Student University of Wisconsin-Madison
Department of Physics
Office: 2304 Chamberlin Hall
Lab: 6247 Chamberlin Hall
Tel:  (608) 301-5736
Cell: (213) 519-1771
Lab: (608) 262-5916

josef...@gmail.com

unread,
May 16, 2010, 6:50:29 AM5/16/10
to SciPy Users List
The definition of Rsquared is pretty uncontroversial with the y.mean()
correction, if there is a constant in the regression (although I know
mainly the linear case for this).

If there is no constant in the regression, the definition or Rsquared
is not clear/unambiguous, but usually used without mean correction of
y.

Josef

Benedikt Riedel

unread,
May 16, 2010, 9:05:54 PM5/16/10
to SciPy Users List
What I still do not understand is the fact that curve_fit gives me a different output then leastsq, even though curve_fit calls leastsq.

I tried to get the chi-squared because we want to plot contours of chi-square from the minimum to the maximum. I used following code:


fitfunc = lambda p,x: p[0]+ p[1]*exp(-x)
errfunc = lambda p, x, y: (y-fitfunc(p,x))
pinit = [20,20.]

def func(x, a, b):
     return a*exp(-x) + b

pfinal, covar = curve_fit(func,tau, R4ctsdataselect, p0=pinit, sigma=R4errctsdataselect)
print pfinal
print covar
dof=size(tau)-size(pinit)
print dof
chi2=(sum(pow(R4ctsdataselect-fitfunc(pinit, tau), 2)/fitfunc(pinit, tau)))/dof
print chi2

I am not 100% sure I am doing the degrees of freedom calculation right. I got the chi-square formula from the Pearson chi-squared test.

Thank you very much for the help so far.

Cheers,

Ben

josef...@gmail.com

unread,
May 16, 2010, 11:33:31 PM5/16/10
to SciPy Users List
On Sun, May 16, 2010 at 9:05 PM, Benedikt Riedel <bri...@wisc.edu> wrote:
> What I still do not understand is the fact that curve_fit gives me a
> different output then leastsq, even though curve_fit calls leastsq.
>
> I tried to get the chi-squared because we want to plot contours of
> chi-square from the minimum to the maximum. I used following code:
>
> fitfunc = lambda p,x: p[0]+ p[1]*exp(-x)
> errfunc = lambda p, x, y: (y-fitfunc(p,x))
> pinit = [20,20.]
>
> def func(x, a, b):
>      return a*exp(-x) + b
>
> pfinal, covar = curve_fit(func,tau, R4ctsdataselect, p0=pinit,
> sigma=R4errctsdataselect)

this uses weighted least squares
sigma : None or N-length sequence
If not None, it represents the standard-deviation of ydata. This
vector, if given, will be used as weights in the least-squares problem

In your initial example with leastsq you don't have any weighting,
it's just ordinary least squares

maybe that's the difference.



> print pfinal
> print covar
> dof=size(tau)-size(pinit)
> print dof
> chi2=(sum(pow(R4ctsdataselect-fitfunc(pinit, tau), 2)/fitfunc(pinit,
> tau)))/dof
> print chi2
>
> I am not 100% sure I am doing the degrees of freedom calculation right. I
> got the chi-square formula from the Pearson chi-squared test.

I don't recognize your formula for chi2, and I don't see the
connection to Pearson chi-squared test .

Do you have a reference?

Josef

Benedikt Riedel

unread,
May 17, 2010, 12:18:00 AM5/17/10
to SciPy Users List
On Sun, May 16, 2010 at 22:33, <josef...@gmail.com> wrote:
On Sun, May 16, 2010 at 9:05 PM, Benedikt Riedel <bri...@wisc.edu> wrote:
> What I still do not understand is the fact that curve_fit gives me a
> different output then leastsq, even though curve_fit calls leastsq.
>
> I tried to get the chi-squared because we want to plot contours of
> chi-square from the minimum to the maximum. I used following code:
>
> fitfunc = lambda p,x: p[0]+ p[1]*exp(-x)
> errfunc = lambda p, x, y: (y-fitfunc(p,x))
> pinit = [20,20.]
>
> def func(x, a, b):
>      return a*exp(-x) + b
>
> pfinal, covar = curve_fit(func,tau, R4ctsdataselect, p0=pinit,
> sigma=R4errctsdataselect)

this uses weighted least squares
sigma : None or N-length sequence
   If not None, it represents the standard-deviation of ydata. This
vector, if given, will be used as weights in the least-squares problem

In your initial example with leastsq you don't have any weighting,
it's just ordinary least squares

maybe that's the difference.



Yeah I guess that will be it.
 

> print pfinal
> print covar
> dof=size(tau)-size(pinit)
> print dof
> chi2=(sum(pow(R4ctsdataselect-fitfunc(pinit, tau), 2)/fitfunc(pinit,
> tau)))/dof
> print chi2
>
> I am not 100% sure I am doing the degrees of freedom calculation right. I
> got the chi-square formula from the Pearson chi-squared test.

I don't recognize your formula for chi2, and I don't see the
connection to Pearson chi-squared test .

Do you have a reference?


I based my use of the Pearson test from what I read in an Econometrics book, but wiki has the a pretty good description. I basically based it off the example there. Where the expected would be what comes out of the fit and what you is the "R4ctsdataselect" for those specific values.

http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test

 
Josef


Thanks again

Ben

 

josef...@gmail.com

unread,
May 17, 2010, 1:20:59 AM5/17/10
to SciPy Users List
I looked at that, but it's a completely different case, the values in
the formulas are frequencies

Oi = an observed frequency;
Ei = an expected (theoretical) frequency, asserted by the null hypothesis;

not points on a regression curve

Josef

Benedikt Riedel

unread,
May 17, 2010, 2:01:07 AM5/17/10
to SciPy Users List
Thanks for the clarification. I am still not sure how to get the chi-squared value of my regression though. When I use the formula under "Regression Analysis" here

http://en.wikipedia.org/wiki/Goodness_of_fit

I get a chi-square somewhere around 19, which seems way to large compared to the value of 3.2 I get for the same data set when I fit it using gnuplot. Where gnuplot supposedly used the weighted sum of squares of residuals. I do not fully this because of the results I get.

Here is the python code I used:

chi2=(sum(pow(R4ctsdataselect-fitfunc(pinit, tau), 2)/pow(R4errctsdataselect,2)))/dof

Sorry for being so thick headed, statistics is just beyond me at times.

Cheers,

Ben

josef...@gmail.com

unread,
May 17, 2010, 7:35:27 AM5/17/10
to SciPy Users List
On Mon, May 17, 2010 at 2:01 AM, Benedikt Riedel <bri...@wisc.edu> wrote:
> Thanks for the clarification. I am still not sure how to get the chi-squared
> value of my regression though. When I use the formula under "Regression
> Analysis" here
>
> http://en.wikipedia.org/wiki/Goodness_of_fit
>
> I get a chi-square somewhere around 19, which seems way to large compared to
> the value of 3.2 I get for the same data set when I fit it using gnuplot.
> Where gnuplot supposedly used the weighted sum of squares of residuals. I do
> not fully this because of the results I get.
>
> Here is the python code I used:
>
> chi2=(sum(pow(R4ctsdataselect-fitfunc(pinit, tau),
> 2)/pow(R4errctsdataselect,2)))/dof


from some gnuplot help page it looks like what they call chisquare is WSSR/dof

which would be something like

chi2=(sum( ( R4ctsdataselect-fitfunc(pinit, tau)) /
sqrt(R4errctsdataselect) )**2 )/dof

I'm not sure whether the sqrt is in there or not, because I don't
remember the normalization that is used, weights or weights squared

Josef

josef...@gmail.com

unread,
May 17, 2010, 11:20:13 AM5/17/10
to SciPy Users List
On Mon, May 17, 2010 at 7:35 AM, <josef...@gmail.com> wrote:
> On Mon, May 17, 2010 at 2:01 AM, Benedikt Riedel <bri...@wisc.edu> wrote:
>> Thanks for the clarification. I am still not sure how to get the chi-squared
>> value of my regression though. When I use the formula under "Regression
>> Analysis" here
>>
>> http://en.wikipedia.org/wiki/Goodness_of_fit
>>
>> I get a chi-square somewhere around 19, which seems way to large compared to
>> the value of 3.2 I get for the same data set when I fit it using gnuplot.
>> Where gnuplot supposedly used the weighted sum of squares of residuals. I do
>> not fully this because of the results I get.
>>
>> Here is the python code I used:
>>
>> chi2=(sum(pow(R4ctsdataselect-fitfunc(pinit, tau),
>> 2)/pow(R4errctsdataselect,2)))/dof
>
>
> from some gnuplot help page it looks like what they call chisquare is WSSR/dof
>
> which would be something like
>
> chi2=(sum(  ( R4ctsdataselect-fitfunc(pinit, tau)) /
> sqrt(R4errctsdataselect) )**2  )/dof
>
> I'm not sure whether the sqrt is in there or not, because I don't
> remember the normalization that is used, weights or weights squared

(for reference)
gnuplot is pretty vague on the denominator
http://theochem.ki.ku.dk/on_line_docs/gnuplot/gnuplot_21.html#SEC81

a bit better explanation of the terminology
http://www.graphpad.com/faq/viewfaq.cfm?faq=926

any more explicit reference has sigma in the denominator

Josef

Benedikt Riedel

unread,
May 17, 2010, 1:26:59 PM5/17/10
to SciPy Users List
Thanks for the references. I have adjusted the code, such that R4errctsdataselect is sigma and not sigma_squared. Oddly, enough I made a stupid mistake by using the original guess for parameters rather than final guess of parameters in my chi-squared, which of course threw me off.

Thanks again for the help.

Cheers,

Ben
Reply all
Reply to author
Forward
0 new messages