[SciPy-User] How to estimate error in polynomial coefficients from scipy.polyfit?

3,130 views
Skip to first unread message

Jeremy Conlin

unread,
Mar 25, 2010, 4:32:39 PM3/25/10
to SciPy Users List
I am using scipy.polyfit to fit a curve to my data. Members of this
list have been integral in my understanding of how to use this
function. Now I would like to know how I can get the uncertainties
(standard deviations) of polynomial coefficients from the returned
values from scipy.polyfit. If I understand correctly, the residuals
are sometimes called the R^2 error, right? That gives an estimate of
how well we have fit the data. I don't know how to use rank or any of
the other returned values to get the uncertainties.

Can someone please help?
Thanks,
Jeremy
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

David Warde-Farley

unread,
Mar 25, 2010, 4:49:51 PM3/25/10
to SciPy Users List
On 25-Mar-10, at 4:32 PM, Jeremy Conlin wrote:

> Now I would like to know how I can get the uncertainties
> (standard deviations) of polynomial coefficients from the returned
> values from scipy.polyfit. If I understand correctly, the residuals
> are sometimes called the R^2 error, right?

No, that's not what R^2 is. See http://en.wikipedia.org/wiki/Coefficient_of_determination

The residuals plus some assumptions about the kind of noise present in
your data can tell you something about the standard deviation of a
given predicted function value, but I'm not sure how to turn that into
a measure of uncertainty on the coefficient itself. My first instinct
would be to do it with bootstrap samples, i.e. if you have N data
points, sample with replacement from that data to generate, say, 100
sets of size N, fit the coefficients on each of them, and take the
empirical standard deviation. I am not sure how polyfit responds to
duplicated points, however, though it should reweight your least
squares estimate a bit.

David

Charles R Harris

unread,
Mar 25, 2010, 4:52:32 PM3/25/10
to SciPy Users List
On Thu, Mar 25, 2010 at 2:32 PM, Jeremy Conlin <jlco...@gmail.com> wrote:
I am using scipy.polyfit to fit a curve to my data.  Members of this
list have been integral in my understanding of how to use this
function.  Now I would like to know how I can get the uncertainties
(standard deviations) of polynomial coefficients from the returned
values from scipy.polyfit.  If I understand correctly, the residuals
are sometimes called the R^2 error, right?  That gives an estimate of
how well we have fit the data.  I don't know how to use rank or any of
the other returned values to get the uncertainties.

Can someone please help?

You want the covariance of the coefficients, (A.T * A)^-1/var, where A is the design matrix. I'd have to see what the scipy fit returns to tell you more. In anycase, from that you can plot curves at +/- sigma to show the error bounds on the result. I can be more explicit if you want.

Chuck

josef...@gmail.com

unread,
Mar 25, 2010, 4:58:32 PM3/25/10
to SciPy Users List

the easiest way to get the full statistical results for the fit is to
use statsmodels

something like
import scikits.statsmodels as sm
results = sm.OLS(y, np.vandermonde(x, order)).fit()
dir(results)
with standard deviations on parameter estimates, ...

results.params should be the same as np.polyfit
results.model.predict(xnew)
and there are somewhere the prediction errors

without statsmodels it is a few lines of code, but requires "thinking"

Josef

Jeremy Conlin

unread,
Mar 25, 2010, 5:20:24 PM3/25/10
to SciPy Users List

Thanks Chuck. That seems to get closer to what I need. I am just
fitting data to a polynomial, nothing too fancy. I would like the
variance (not the covariance) of the coeffficients. As far as what is
returned from scipy.polyfit this is what the documentation says:

residuals, rank, singular_values, rcond : present only if `full` = True
Residuals of the least-squares fit, the effective rank of the scaled
Vandermonde coefficient matrix, its singular values, and the specified
value of `rcond`. For more details, see `linalg.lstsq`.

I don't think any of these things are "design matrix" as you have
indicated I need. The documentation for linalg.lstsq does not say
what rcond is unfortunately. Any ideas?

Anne Archibald

unread,
Mar 25, 2010, 6:21:31 PM3/25/10
to SciPy Users List
On 25 March 2010 17:20, Jeremy Conlin <jlco...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 2:52 PM, Charles R Harris
> <charles...@gmail.com> wrote:
>>
>>
>> On Thu, Mar 25, 2010 at 2:32 PM, Jeremy Conlin <jlco...@gmail.com> wrote:
>>>
>>> I am using scipy.polyfit to fit a curve to my data.  Members of this
>>> list have been integral in my understanding of how to use this
>>> function.  Now I would like to know how I can get the uncertainties
>>> (standard deviations) of polynomial coefficients from the returned
>>> values from scipy.polyfit.  If I understand correctly, the residuals
>>> are sometimes called the R^2 error, right?  That gives an estimate of
>>> how well we have fit the data.  I don't know how to use rank or any of
>>> the other returned values to get the uncertainties.
>>>
>>> Can someone please help?
>>
>> You want the covariance of the coefficients, (A.T * A)^-1/var, where A is
>> the design matrix. I'd have to see what the scipy fit returns to tell you
>> more. In anycase, from that you can plot curves at +/- sigma to show the
>> error bounds on the result. I can be more explicit if you want.
>>
>> Chuck
>
> Thanks Chuck.  That seems to get closer to what I need.  I am just
> fitting data to a polynomial, nothing too fancy.  I would like the
> variance (not the covariance) of the coeffficients.  As far as what is
> returned from scipy.polyfit this is what the documentation says:

Just an important warning: for polynomials, especially high degree
polynomials, the coefficients are an awful way to specify them. If
what you really need is the coefficients, then you're stuck with it,
but if what you need is to estimate interpolated values, there are
better approaches.

I mention this in particular because you talk about needing "variance
rather than covariance". When you do a linear fit of a
several-parameter model to data with normal errors, the uncertainty in
the fitted parameters is a multidimensional Gaussian - that is, the
(say) one-sigma error region is a multidimensional ellipsoid. If you
chose a good parameterization for your model, this ellipsoid will line
up nicely with the axes, and you can describe it by its size along
each axis: that is, you only need a variance in each parameter.

But if your parameterization is not good, then this ellipse will be
some tilted long skinny shape, and taking its size along one axis can
drastically underestimate its size. You also have the problem that
there are correlations in your parameters: if one is high, say, then
the other is also likely to be high. The covariance matrix captures
all this information along with the variances you asked for. As you
might expect, life is vastly simpler if the ellipsoid is aligned with
the axes, so that this matrix is nearly diagonal and you can just read
the variances off the diagonal.

Unfortunately, the parameterization of polynomials by their
coefficients is not "good" in this sense: you almost always get very
long skinny non-aligned error ellipses. An easy example is if you're
fitting positive x,y pairs with a straight line y=mx+b. Then if m is
too low, b is almost certainly too high, since the uncertainty pivots
the line around the middle of the data set. With linear data you can
help this by using y = m(x-x_middle)+b, but when you go to
higher-order polynomials it all becomes messier and you can't usually
cure the problems this easily.

I would say, look closely at your problem and think about whether you
really need the polynomial coefficients.

Anne

Jeremy Conlin

unread,
Mar 25, 2010, 6:40:06 PM3/25/10
to SciPy Users List

Yikes! This sounds like it may be more trouble than it's worth. I
have a few sets of statistical data that each need to have curves fit
to them. I can see what the parameters are and they are similar
between each set. I want to know if the coefficients are within a few
standard deviations of each other. That's what I was hoping to get
at, but it seems to be more trouble than I'm willing to accept.
Unless anyone has a simpler solution.

Thanks,
Jeremy

Anne Archibald

unread,
Mar 25, 2010, 6:45:56 PM3/25/10
to SciPy Users List

You could try using the chebyshev polynomials; they're in the new
numpy, or you can just use the formulas (the degree-n one is cos(n
arccos(x)), and they're sensible on [0,1]). They're not perfect, but
they're a much better representation of polynomials that gets rid of
most of the problems I described above (the covariances tend to be
much less).

Anne

Charles R Harris

unread,
Mar 25, 2010, 8:17:02 PM3/25/10
to SciPy Users List

lstsq is a bit deficient in that regard, it doesn't return all the info you need. The design matrix is the matrix A in the equation Ac=y that you want to solve using least squares. In the case of polynomial fits it is the Vandermode matrix of the desired degree given by vander(x, ndeg + 1), where x is the vector of the sample points. To get error estimates you can either just use the package Josef mentioned or you can go ahead and roll your own, which is what I generally do. If your sample points are reasonably distributed and the fit is of low degree you can explicitly compute dot(A.T, A).inv and multiply it by the residual/(number of data points - number of coefficients). That gives the covariance, the square root of its diagonal is the estimated standard deviation of the coefficients. Note that this method of estimating the variance from the residual isn't very accurate unless you have a lot of data points, it's better to have independent estimates of the measurement errors if you can manage that.

Chuck

josef...@gmail.com

unread,
Mar 25, 2010, 8:23:51 PM3/25/10
to SciPy Users List
On Thu, Mar 25, 2010 at 6:45 PM, Anne Archibald

I don't think that multicollinearity matters so much when all
coefficients included in the null hypothesis, because the test is
based on the entire covariance of the parameter estimates and not just
the variance, e.g. with F-test.

One of the first references for testing equality of regression
coefficients across regressions that google found is

http://www.stata.com/support/faqs/stat/testing.html

Josef
(google is swallowing my sent emails today, so it will come up double
or not at all)

David Goldsmith

unread,
Mar 25, 2010, 11:40:46 PM3/25/10
to SciPy Users List
On Thu, Mar 25, 2010 at 3:40 PM, Jeremy Conlin <jlco...@gmail.com> wrote:
Yikes!  This sounds like it may be more trouble than it's worth.  I
have a few sets of statistical data that each need to have curves fit
to them.  

That's an awfully generic need - it may be obvious from examination of the data that a line is inappropriate, but besides polynomials there are many other non-linear models (which can be linearly fit to data by means of data transformation) which possess fewer parameters (and thus are simpler from a parameter analysis perspective).  So, the question is: why are you fitting to polynomials?  If it's just to get a good fit to the data, you might be getting "more fit" than your data warrants (and even if that isn't a problem, you probably want to use a polynomial form different from "standard form," e.g., Lagrange interpolators).  Are you sure something like an exponential growth/decay or power law model (both of which are "more natural," linearizable, two-parameter models) wouldn't be more appropriate - it would almost certainly be simpler to analyze (and perhaps easier to justify to a referee).

On this note, perhaps some of our experts might care to comment: what "physics" (in a generalized sense) gives rise to a polynomial dependency of degree higher than two?  The only generic thing I can think of is something where third or higher order derivatives proportional to the independent variable are important, and those are pretty uncommon.

DG
 

josef...@gmail.com

unread,
Mar 26, 2010, 12:10:37 AM3/26/10
to SciPy Users List
On Thu, Mar 25, 2010 at 11:40 PM, David Goldsmith
<d.l.go...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 3:40 PM, Jeremy Conlin <jlco...@gmail.com> wrote:
>>
>> Yikes!  This sounds like it may be more trouble than it's worth.  I
>> have a few sets of statistical data that each need to have curves fit
>> to them.
>
> That's an awfully generic need - it may be obvious from examination of the
> data that a line is inappropriate, but besides polynomials there are many
> other non-linear models (which can be linearly fit to data by means of data
> transformation) which possess fewer parameters (and thus are simpler from a
> parameter analysis perspective).  So, the question is: why are you fitting
> to polynomials?  If it's just to get a good fit to the data, you might be
> getting "more fit" than your data warrants (and even if that isn't a
> problem, you probably want to use a polynomial form different from "standard
> form," e.g., Lagrange interpolators).  Are you sure something like an
> exponential growth/decay or power law model (both of which are "more
> natural," linearizable, two-parameter models) wouldn't be more appropriate -
> it would almost certainly be simpler to analyze (and perhaps easier to
> justify to a referee).
>
> On this note, perhaps some of our experts might care to comment: what
> "physics" (in a generalized sense) gives rise to a polynomial dependency of
> degree higher than two?  The only generic thing I can think of is something
> where third or higher order derivatives proportional to the independent
> variable are important, and those are pretty uncommon.

Taylor series expansion is a "natural" source for higher order
polynomials, although as Anne pointed out, it's not a "nice" base for
the space of functions. In most of the examples I have seen, 3rd order
was the highest, after that non-parametric, Fourier, Chebychev or
Hermite.

I'm attaching the statsmodels equivalent of most of the example in the
Stata FAQ. A few parts are (still) a bit more complicated than the
Stata equivalent.
Final result: F-test rejects null of equal coefficients (small
p-value) if the coefficients are far enough apart, and does not reject
if the two regression equations are samples from the same data
generating process.

Josef

try_2regress.py

Charles R Harris

unread,
Mar 26, 2010, 12:14:25 AM3/26/10
to SciPy Users List
On Thu, Mar 25, 2010 at 9:40 PM, David Goldsmith <d.l.go...@gmail.com> wrote:
On Thu, Mar 25, 2010 at 3:40 PM, Jeremy Conlin <jlco...@gmail.com> wrote:
Yikes!  This sounds like it may be more trouble than it's worth.  I
have a few sets of statistical data that each need to have curves fit
to them.  

That's an awfully generic need - it may be obvious from examination of the data that a line is inappropriate, but besides polynomials there are many other non-linear models (which can be linearly fit to data by means of data transformation) which possess fewer parameters (and thus are simpler from a parameter analysis perspective).  So, the question is: why are you fitting to polynomials?  If it's just to get a good fit to the data, you might be getting "more fit" than your data warrants (and even if that isn't a problem, you probably want to use a polynomial form different from "standard form," e.g., Lagrange interpolators).  Are you sure something like an exponential growth/decay or power law model (both of which are "more natural," linearizable, two-parameter models) wouldn't be more appropriate - it would almost certainly be simpler to analyze (and perhaps easier to justify to a referee).

On this note, perhaps some of our experts might care to comment: what "physics" (in a generalized sense) gives rise to a polynomial dependency of degree higher than two?  The only generic thing I can think of is something where third or higher order derivatives proportional to the independent variable are important, and those are pretty uncommon.


Oh, there are lots of things. Detector responses, platinum resistance thermometers, optical aberrations, so on and so forth. Polynomials aren't always appropriate and they are usually useless for extrapolation, but they are often invaluable for getting a compact representation of the data.

Chuck 

PHo...@geosyntec.com

unread,
Mar 26, 2010, 1:33:36 PM3/26/10
to scipy...@scipy.org
I'm no expert, but the power required to overcome aerodynamic drag varies with the cube of speed -- though the physics behind that is pretty well understood.

I guess if you were doing a bulk estimate of all of the other factors (fluid density & viscosity, drag coeff, etc) this would be an applicable use case.

Like I said, I'm hardly an expert...
-paul h.

# -------------------------
From: scipy-use...@scipy.org [mailto:scipy-use...@scipy.org] On Behalf Of David Goldsmith
Sent: Thursday, March 25, 2010 8:41 PM
To: SciPy Users List
Subject: Re: [SciPy-User] How to estimate error in polynomial coefficients from scipy.polyfit?

[snip]

On this note, perhaps some of our experts might care to comment: what "physics" (in a generalized sense) gives rise to a polynomial dependency of degree higher than two?  The only generic thing I can think of is something where third or higher order derivatives proportional to the independent variable are important, and those are pretty uncommon.

_______________________________________________

al...@ajackson.org

unread,
Mar 28, 2010, 2:36:53 PM3/28/10
to scipy...@scipy.org
Crack permeability goes like the third power of the opening (that is,
fluid flow through cracks - think gas or oil in a fractured rock).

And for many problems, the Taylor expansion might appropriately be taken out to
third order. Sometimes the even orders cancel out so to get the first
non-linear effects you have to go to third order. Can't think of a specific
offhand, but I've seen it quite a few times.


--
-----------------------------------------------------------------------
| Alan K. Jackson | To see a World in a Grain of Sand |
| al...@ajackson.org | And a Heaven in a Wild Flower, |
| www.ajackson.org | Hold Infinity in the palm of your hand |
| Houston, Texas | And Eternity in an hour. - Blake |
-----------------------------------------------------------------------

Anne Archibald

unread,
Mar 28, 2010, 7:46:29 PM3/28/10
to SciPy Users List
On 25 March 2010 23:40, David Goldsmith <d.l.go...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 3:40 PM, Jeremy Conlin <jlco...@gmail.com> wrote:
>>
>> Yikes!  This sounds like it may be more trouble than it's worth.  I
>> have a few sets of statistical data that each need to have curves fit
>> to them.
>
> That's an awfully generic need - it may be obvious from examination of the
> data that a line is inappropriate, but besides polynomials there are many
> other non-linear models (which can be linearly fit to data by means of data
> transformation) which possess fewer parameters (and thus are simpler from a
> parameter analysis perspective).  So, the question is: why are you fitting
> to polynomials?  If it's just to get a good fit to the data, you might be
> getting "more fit" than your data warrants (and even if that isn't a
> problem, you probably want to use a polynomial form different from "standard
> form," e.g., Lagrange interpolators).  Are you sure something like an
> exponential growth/decay or power law model (both of which are "more
> natural," linearizable, two-parameter models) wouldn't be more appropriate -
> it would almost certainly be simpler to analyze (and perhaps easier to
> justify to a referee).
>
> On this note, perhaps some of our experts might care to comment: what
> "physics" (in a generalized sense) gives rise to a polynomial dependency of
> degree higher than two?  The only generic thing I can think of is something
> where third or higher order derivatives proportional to the independent
> variable are important, and those are pretty uncommon.

In pulsar timing, what you measure is phase (actually pulse arrival
time); pulse period/frequency is its derivative, and spin-down is the
derivative of period and measures the power available for producing
radiation. These two parameters are also used in standard formulas to
infer magnetic field and age. The second derivative of period (third
derivative of the measured quantity) is generally too small to be
measured, but when it can be it tells you something about the
mechanism by which the spin-down is happening.

Beyond the plain physics, there is typically some low-frequency noise
(nobody's sure where it comes from, possibly the turbulence in the
superfluid layers of the neutron star) which is typically managed by
subtracting a best-fit polynomial (polynomial subtraction has the
property of killing off noise spectra that diverge at zero frequency,
which this noise appears to).

I suppose power laws, though ubiquitous, do not exactly count as
polynomials; generally their exponents are not integers.

Anne

David Goldsmith

unread,
Mar 28, 2010, 11:26:17 PM3/28/10
to SciPy Users List
On Sun, Mar 28, 2010 at 11:36 AM, <al...@ajackson.org> wrote:
Crack permeability goes like the third power of the opening (that is,
fluid flow through cracks - think gas or oil in a fractured rock).

Power law or polynomial: from a regression stand point, there's quite a big difference.

And for many problems, the Taylor expansion might appropriately be taken out to
third order. Sometimes the even orders cancel out so to get the first
non-linear effects you have to go to third order. Can't think of a specific
offhand, but I've seen it quite a few times.

A slightly less-idealized pendulum.
 
>I'm no expert, but the power required to overcome aerodynamic drag varies with the cube of speed

Same comment as above.

In my experience, compared to multi-term polynomial models, power law models are comparatively common (though I acknowledge that, as the power is one of the fit parameters, there's no a priori guarantee that it'll be an integer; indeed, since the integers have measure zero rel. the reals, the odds of getting an integer are zero, *unless* there is some "physical" reason why it should be an integer).

DG
 

Robert Kern

unread,
Mar 29, 2010, 10:34:15 AM3/29/10
to SciPy Users List
On Sun, Mar 28, 2010 at 22:26, David Goldsmith <d.l.go...@gmail.com> wrote:
> On Sun, Mar 28, 2010 at 11:36 AM, <al...@ajackson.org> wrote:
>>
>> Crack permeability goes like the third power of the opening (that is,
>> fluid flow through cracks - think gas or oil in a fractured rock).
>
> Power law or polynomial: from a regression stand point, there's quite a big
> difference.

"Third power" == "x**3". He's not talking about a power law.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco

Jeremy Conlin

unread,
Mar 29, 2010, 11:08:48 AM3/29/10
to SciPy Users List
On Thu, Mar 25, 2010 at 9:40 PM, David Goldsmith
<d.l.go...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 3:40 PM, Jeremy Conlin <jlco...@gmail.com> wrote:
>>
>> Yikes!  This sounds like it may be more trouble than it's worth.  I
>> have a few sets of statistical data that each need to have curves fit
>> to them.
>
> That's an awfully generic need - it may be obvious from examination of the
> data that a line is inappropriate, but besides polynomials there are many
> other non-linear models (which can be linearly fit to data by means of data
> transformation) which possess fewer parameters (and thus are simpler from a
> parameter analysis perspective).  So, the question is: why are you fitting
> to polynomials?  If it's just to get a good fit to the data, you might be
> getting "more fit" than your data warrants (and even if that isn't a
> problem, you probably want to use a polynomial form different from "standard
> form," e.g., Lagrange interpolators).  Are you sure something like an
> exponential growth/decay or power law model (both of which are "more
> natural," linearizable, two-parameter models) wouldn't be more appropriate -
> it would almost certainly be simpler to analyze (and perhaps easier to
> justify to a referee).
>
> On this note, perhaps some of our experts might care to comment: what
> "physics" (in a generalized sense) gives rise to a polynomial dependency of
> degree higher than two?  The only generic thing I can think of is something
> where third or higher order derivatives proportional to the independent
> variable are important, and those are pretty uncommon.

I will only be fitting data to a first or second degree polynomial.
Eventually I would like to fit my data to an exponential or a power
law, just to see how it compares to a low-order polynomial. Choosing
these functions was based on qualitative analysis (i.e. "it looks
quadratic").

The best case scenario would be that I take what I learn from this
"simple" example and apply it to more difficult problems as they come
along down the road. It appears, however, that it's not so simple to
apply it to other problems. I wish I had more time to learn about
fitting data to curves. I'm sure there are a lot of powerful tools
that can help.

Anne Archibald

unread,
Mar 29, 2010, 12:24:33 PM3/29/10
to SciPy Users List

I should say, then, that most of my dire-sounding comments can be
ignored in this case as long as you rescale the data to (-1,1) before
fitting your quadratic. (Which, come to think of it, gives you very
nearly the Chebyshev basis, but never mind.)

Exponentials and power laws are even easier, since they're linear when
you take logs of the appropriate coordinates. (Though here too, it
helps to center your data before fitting.)

> The best case scenario would be that I take what I learn from this
> "simple" example and apply it to more difficult problems as they come
> along down the road.  It appears, however, that it's not so simple to
> apply it to other problems.  I wish I had more time to learn about
> fitting data to curves.  I'm sure there are a lot of powerful tools
> that can help.

This is a fine approach; in fact even if you end up doing non-linear
fitting (i.e. fits where the function is not a linear combination of
the parameters), the standard approach to getting errors is to just
use the derivatives of the fitting function at the best fit to treat
the problem as locally linear. (There are more elaborate alternatives
as well, including Monte Carlo approaches, which may be worth looking
into.)

Anne

David Goldsmith

unread,
Mar 29, 2010, 1:13:41 PM3/29/10
to SciPy Users List

This is a good approach if applied to the right data: the residuals, i.e., the (signed, not absolute value) differences between the predicted and corresponding measured dependent variable values.  For example, if you fit a line to quadratic data and then plot the residuals v. the independent variable, you'll see systematic error: most of the residuals at the extremities will be on one side of zero, while most of those around the center will be on the other side of zero, i.e., the residuals will look like a parabola.  If you then fit the original data to a quadratic model (and a quadratic model is "physically" appropriate) and plot the residuals, these should appear to be randomly (and rather symmetrically) distributed around zero.  This generalizes: systematic behavior of residuals reveals deficiencies in the assumed form of your model - the "best" (most "physical") model produces residuals randomly (though not necessarily uniformly) distributed around zero (because the "best" model captures all non-randomness in your data).  Moral: when doing regression where you're not a priori certain of your model, always plot the residuals.

DG


 

David Goldsmith

unread,
Mar 29, 2010, 1:31:52 PM3/29/10
to SciPy Users List
On Mon, Mar 29, 2010 at 7:34 AM, Robert Kern <rober...@gmail.com> wrote:
On Sun, Mar 28, 2010 at 22:26, David Goldsmith <d.l.go...@gmail.com> wrote:
> On Sun, Mar 28, 2010 at 11:36 AM, <al...@ajackson.org> wrote:
>>
>> Crack permeability goes like the third power of the opening (that is,
>> fluid flow through cracks - think gas or oil in a fractured rock).
>
> Power law or polynomial: from a regression stand point, there's quite a big
> difference.

"Third power" == "x**3". He's not talking about a power law.

Yes, I know that, but from a regression stand point, unless there's an offset (constant) term (in which case a two parameter polynomial fit is what you'll be doing) if your model is simply y = ax**3, aren't you better off doing the regression as if you were doing a power law, albeit w/ a fixed power (i.e., log transforming the data first, fixing the slope parameter at three, and then regressing to find the constant term, i.e., log(a))? 

In other words, I was soliciting examples of situations where a true polynomial (as opposed to a monomial) model was appropriate - I maintain that if your model is a monomial (integer power law model, w/ only one term), then, as far as regression is concerned, it is more appropriate to think of it as a power law model w/ a fixed parameter, not as a polynomial model.  From this perspective, the number of "naturally occurring" polynomial models is greatly reduced.

DG
 

josef...@gmail.com

unread,
Mar 29, 2010, 1:47:15 PM3/29/10
to SciPy Users List
On Mon, Mar 29, 2010 at 1:31 PM, David Goldsmith

<d.l.go...@gmail.com> wrote:
> On Mon, Mar 29, 2010 at 7:34 AM, Robert Kern <rober...@gmail.com> wrote:
>>
>> On Sun, Mar 28, 2010 at 22:26, David Goldsmith <d.l.go...@gmail.com>
>> wrote:
>> > On Sun, Mar 28, 2010 at 11:36 AM, <al...@ajackson.org> wrote:
>> >>
>> >> Crack permeability goes like the third power of the opening (that is,
>> >> fluid flow through cracks - think gas or oil in a fractured rock).
>> >
>> > Power law or polynomial: from a regression stand point, there's quite a
>> > big
>> > difference.
>>
>> "Third power" == "x**3". He's not talking about a power law.
>
> Yes, I know that, but from a regression stand point, unless there's an
> offset (constant) term (in which case a two parameter polynomial fit is what
> you'll be doing) if your model is simply y = ax**3, aren't you better off
> doing the regression as if you were doing a power law, albeit w/ a fixed
> power (i.e., log transforming the data first, fixing the slope parameter at
> three, and then regressing to find the constant term, i.e., log(a))?
>
> In other words, I was soliciting examples of situations where a true
> polynomial (as opposed to a monomial) model was appropriate - I maintain
> that if your model is a monomial (integer power law model, w/ only one
> term), then, as far as regression is concerned, it is more appropriate to
> think of it as a power law model w/ a fixed parameter, not as a polynomial
> model.  From this perspective, the number of "naturally occurring"
> polynomial models is greatly reduced.

That looks to me like splitting hairs.

It depends on the statistical model for the regression error, e.g.

y = ax**3 + u where u is normal, additive noise,
or
y = ax**3 *z where z is log-normal, multiplicative noise
ln(y) = ln(a) + 3*ln(x) + u with u = ln(z)

I would do it if I want to estimate or test if 3 is the correct power,
but not if 3 is known.

Josef

David Goldsmith

unread,
Mar 29, 2010, 1:55:52 PM3/29/10
to SciPy Users List

Do what?

DG
 

josef...@gmail.com

unread,
Mar 29, 2010, 1:59:58 PM3/29/10
to SciPy Users List
On Mon, Mar 29, 2010 at 1:55 PM, David Goldsmith

"it" = (i.e., log transforming the data first, fixing the slope parameter at


three, and then regressing to find the constant term, i.e., log(a))?

Josef

Anne Archibald

unread,
Mar 29, 2010, 2:02:09 PM3/29/10
to SciPy Users List

I would shorten that to "always plot the residuals". In fact, that's
one of the problems with unbinned maximum-likelihood methods: there
often isn't any feasible notion of "residual".

As a quality-of-fit measure, the standard tool is the chi-squared
distribution: if you (linearly) fit an n-parameter model to an m-point
data set, and the model captures the true behaviour of your system,
then the sum of the squared errors divided by the variances is
distributed according to a chi-squared distribution with m-n degrees
of freedom. So you can calculate this number and evaluate the
likelihood that the Gaussian noise would produce a quality-of-fit this
bad. If you get a tiny probability, you can be pretty sure your model
is not fitting the data adequately.

Also worth noting is the highly annoying situation where you have what
purports to be a good fit: the chi-squared gives you a decent
probability of noise producing a fit this bad, and the error bars on
the residuals all nicely include zero, but there's *still* a
discernible trend in the residuals. I interpret this to mean that not
only does my model not capture the system's behaviour, but I have also
overestimated my error bars: the fact that there's a good fit in spite
of the trend means that there is not as much scatter in the residuals
as there should be.

Anne

Reply all
Reply to author
Forward
0 new messages