Can someone please help?
Thanks,
Jeremy
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
> Now I would like to know how I can get the uncertainties
> (standard deviations) of polynomial coefficients from the returned
> values from scipy.polyfit. If I understand correctly, the residuals
> are sometimes called the R^2 error, right?
No, that's not what R^2 is. See http://en.wikipedia.org/wiki/Coefficient_of_determination
The residuals plus some assumptions about the kind of noise present in
your data can tell you something about the standard deviation of a
given predicted function value, but I'm not sure how to turn that into
a measure of uncertainty on the coefficient itself. My first instinct
would be to do it with bootstrap samples, i.e. if you have N data
points, sample with replacement from that data to generate, say, 100
sets of size N, fit the coefficients on each of them, and take the
empirical standard deviation. I am not sure how polyfit responds to
duplicated points, however, though it should reweight your least
squares estimate a bit.
David
I am using scipy.polyfit to fit a curve to my data. Members of this
list have been integral in my understanding of how to use this
function. Now I would like to know how I can get the uncertainties
(standard deviations) of polynomial coefficients from the returned
values from scipy.polyfit. If I understand correctly, the residuals
are sometimes called the R^2 error, right? That gives an estimate of
how well we have fit the data. I don't know how to use rank or any of
the other returned values to get the uncertainties.
Can someone please help?
the easiest way to get the full statistical results for the fit is to
use statsmodels
something like
import scikits.statsmodels as sm
results = sm.OLS(y, np.vandermonde(x, order)).fit()
dir(results)
with standard deviations on parameter estimates, ...
results.params should be the same as np.polyfit
results.model.predict(xnew)
and there are somewhere the prediction errors
without statsmodels it is a few lines of code, but requires "thinking"
Josef
Thanks Chuck. That seems to get closer to what I need. I am just
fitting data to a polynomial, nothing too fancy. I would like the
variance (not the covariance) of the coeffficients. As far as what is
returned from scipy.polyfit this is what the documentation says:
residuals, rank, singular_values, rcond : present only if `full` = True
Residuals of the least-squares fit, the effective rank of the scaled
Vandermonde coefficient matrix, its singular values, and the specified
value of `rcond`. For more details, see `linalg.lstsq`.
I don't think any of these things are "design matrix" as you have
indicated I need. The documentation for linalg.lstsq does not say
what rcond is unfortunately. Any ideas?
Just an important warning: for polynomials, especially high degree
polynomials, the coefficients are an awful way to specify them. If
what you really need is the coefficients, then you're stuck with it,
but if what you need is to estimate interpolated values, there are
better approaches.
I mention this in particular because you talk about needing "variance
rather than covariance". When you do a linear fit of a
several-parameter model to data with normal errors, the uncertainty in
the fitted parameters is a multidimensional Gaussian - that is, the
(say) one-sigma error region is a multidimensional ellipsoid. If you
chose a good parameterization for your model, this ellipsoid will line
up nicely with the axes, and you can describe it by its size along
each axis: that is, you only need a variance in each parameter.
But if your parameterization is not good, then this ellipse will be
some tilted long skinny shape, and taking its size along one axis can
drastically underestimate its size. You also have the problem that
there are correlations in your parameters: if one is high, say, then
the other is also likely to be high. The covariance matrix captures
all this information along with the variances you asked for. As you
might expect, life is vastly simpler if the ellipsoid is aligned with
the axes, so that this matrix is nearly diagonal and you can just read
the variances off the diagonal.
Unfortunately, the parameterization of polynomials by their
coefficients is not "good" in this sense: you almost always get very
long skinny non-aligned error ellipses. An easy example is if you're
fitting positive x,y pairs with a straight line y=mx+b. Then if m is
too low, b is almost certainly too high, since the uncertainty pivots
the line around the middle of the data set. With linear data you can
help this by using y = m(x-x_middle)+b, but when you go to
higher-order polynomials it all becomes messier and you can't usually
cure the problems this easily.
I would say, look closely at your problem and think about whether you
really need the polynomial coefficients.
Anne
Yikes! This sounds like it may be more trouble than it's worth. I
have a few sets of statistical data that each need to have curves fit
to them. I can see what the parameters are and they are similar
between each set. I want to know if the coefficients are within a few
standard deviations of each other. That's what I was hoping to get
at, but it seems to be more trouble than I'm willing to accept.
Unless anyone has a simpler solution.
Thanks,
Jeremy
You could try using the chebyshev polynomials; they're in the new
numpy, or you can just use the formulas (the degree-n one is cos(n
arccos(x)), and they're sensible on [0,1]). They're not perfect, but
they're a much better representation of polynomials that gets rid of
most of the problems I described above (the covariances tend to be
much less).
Anne
I don't think that multicollinearity matters so much when all
coefficients included in the null hypothesis, because the test is
based on the entire covariance of the parameter estimates and not just
the variance, e.g. with F-test.
One of the first references for testing equality of regression
coefficients across regressions that google found is
http://www.stata.com/support/faqs/stat/testing.html
Josef
(google is swallowing my sent emails today, so it will come up double
or not at all)
Yikes! This sounds like it may be more trouble than it's worth. Ihave a few sets of statistical data that each need to have curves fit
to them.
Taylor series expansion is a "natural" source for higher order
polynomials, although as Anne pointed out, it's not a "nice" base for
the space of functions. In most of the examples I have seen, 3rd order
was the highest, after that non-parametric, Fourier, Chebychev or
Hermite.
I'm attaching the statsmodels equivalent of most of the example in the
Stata FAQ. A few parts are (still) a bit more complicated than the
Stata equivalent.
Final result: F-test rejects null of equal coefficients (small
p-value) if the coefficients are far enough apart, and does not reject
if the two regression equations are samples from the same data
generating process.
Josef
On Thu, Mar 25, 2010 at 3:40 PM, Jeremy Conlin <jlco...@gmail.com> wrote:Yikes! This sounds like it may be more trouble than it's worth. Ihave a few sets of statistical data that each need to have curves fit
to them.
That's an awfully generic need - it may be obvious from examination of the data that a line is inappropriate, but besides polynomials there are many other non-linear models (which can be linearly fit to data by means of data transformation) which possess fewer parameters (and thus are simpler from a parameter analysis perspective). So, the question is: why are you fitting to polynomials? If it's just to get a good fit to the data, you might be getting "more fit" than your data warrants (and even if that isn't a problem, you probably want to use a polynomial form different from "standard form," e.g., Lagrange interpolators). Are you sure something like an exponential growth/decay or power law model (both of which are "more natural," linearizable, two-parameter models) wouldn't be more appropriate - it would almost certainly be simpler to analyze (and perhaps easier to justify to a referee).
On this note, perhaps some of our experts might care to comment: what "physics" (in a generalized sense) gives rise to a polynomial dependency of degree higher than two? The only generic thing I can think of is something where third or higher order derivatives proportional to the independent variable are important, and those are pretty uncommon.
I guess if you were doing a bulk estimate of all of the other factors (fluid density & viscosity, drag coeff, etc) this would be an applicable use case.
Like I said, I'm hardly an expert...
-paul h.
# -------------------------
From: scipy-use...@scipy.org [mailto:scipy-use...@scipy.org] On Behalf Of David Goldsmith
Sent: Thursday, March 25, 2010 8:41 PM
To: SciPy Users List
Subject: Re: [SciPy-User] How to estimate error in polynomial coefficients from scipy.polyfit?
[snip]
On this note, perhaps some of our experts might care to comment: what "physics" (in a generalized sense) gives rise to a polynomial dependency of degree higher than two? The only generic thing I can think of is something where third or higher order derivatives proportional to the independent variable are important, and those are pretty uncommon.
_______________________________________________
And for many problems, the Taylor expansion might appropriately be taken out to
third order. Sometimes the even orders cancel out so to get the first
non-linear effects you have to go to third order. Can't think of a specific
offhand, but I've seen it quite a few times.
--
-----------------------------------------------------------------------
| Alan K. Jackson | To see a World in a Grain of Sand |
| al...@ajackson.org | And a Heaven in a Wild Flower, |
| www.ajackson.org | Hold Infinity in the palm of your hand |
| Houston, Texas | And Eternity in an hour. - Blake |
-----------------------------------------------------------------------
In pulsar timing, what you measure is phase (actually pulse arrival
time); pulse period/frequency is its derivative, and spin-down is the
derivative of period and measures the power available for producing
radiation. These two parameters are also used in standard formulas to
infer magnetic field and age. The second derivative of period (third
derivative of the measured quantity) is generally too small to be
measured, but when it can be it tells you something about the
mechanism by which the spin-down is happening.
Beyond the plain physics, there is typically some low-frequency noise
(nobody's sure where it comes from, possibly the turbulence in the
superfluid layers of the neutron star) which is typically managed by
subtracting a best-fit polynomial (polynomial subtraction has the
property of killing off noise spectra that diverge at zero frequency,
which this noise appears to).
I suppose power laws, though ubiquitous, do not exactly count as
polynomials; generally their exponents are not integers.
Anne
Crack permeability goes like the third power of the opening (that is,
fluid flow through cracks - think gas or oil in a fractured rock).
And for many problems, the Taylor expansion might appropriately be taken out to
third order. Sometimes the even orders cancel out so to get the first
non-linear effects you have to go to third order. Can't think of a specific
offhand, but I've seen it quite a few times.
>I'm no expert, but the power required to overcome aerodynamic drag varies with the cube of speed
"Third power" == "x**3". He's not talking about a power law.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
I will only be fitting data to a first or second degree polynomial.
Eventually I would like to fit my data to an exponential or a power
law, just to see how it compares to a low-order polynomial. Choosing
these functions was based on qualitative analysis (i.e. "it looks
quadratic").
The best case scenario would be that I take what I learn from this
"simple" example and apply it to more difficult problems as they come
along down the road. It appears, however, that it's not so simple to
apply it to other problems. I wish I had more time to learn about
fitting data to curves. I'm sure there are a lot of powerful tools
that can help.
I should say, then, that most of my dire-sounding comments can be
ignored in this case as long as you rescale the data to (-1,1) before
fitting your quadratic. (Which, come to think of it, gives you very
nearly the Chebyshev basis, but never mind.)
Exponentials and power laws are even easier, since they're linear when
you take logs of the appropriate coordinates. (Though here too, it
helps to center your data before fitting.)
> The best case scenario would be that I take what I learn from this
> "simple" example and apply it to more difficult problems as they come
> along down the road. It appears, however, that it's not so simple to
> apply it to other problems. I wish I had more time to learn about
> fitting data to curves. I'm sure there are a lot of powerful tools
> that can help.
This is a fine approach; in fact even if you end up doing non-linear
fitting (i.e. fits where the function is not a linear combination of
the parameters), the standard approach to getting errors is to just
use the derivatives of the fitting function at the best fit to treat
the problem as locally linear. (There are more elaborate alternatives
as well, including Monte Carlo approaches, which may be worth looking
into.)
Anne
On Sun, Mar 28, 2010 at 22:26, David Goldsmith <d.l.go...@gmail.com> wrote:"Third power" == "x**3". He's not talking about a power law.
> On Sun, Mar 28, 2010 at 11:36 AM, <al...@ajackson.org> wrote:
>>
>> Crack permeability goes like the third power of the opening (that is,
>> fluid flow through cracks - think gas or oil in a fractured rock).
>
> Power law or polynomial: from a regression stand point, there's quite a big
> difference.
That looks to me like splitting hairs.
It depends on the statistical model for the regression error, e.g.
y = ax**3 + u where u is normal, additive noise,
or
y = ax**3 *z where z is log-normal, multiplicative noise
ln(y) = ln(a) + 3*ln(x) + u with u = ln(z)
I would do it if I want to estimate or test if 3 is the correct power,
but not if 3 is known.
Josef
"it" = (i.e., log transforming the data first, fixing the slope parameter at
three, and then regressing to find the constant term, i.e., log(a))?
Josef
I would shorten that to "always plot the residuals". In fact, that's
one of the problems with unbinned maximum-likelihood methods: there
often isn't any feasible notion of "residual".
As a quality-of-fit measure, the standard tool is the chi-squared
distribution: if you (linearly) fit an n-parameter model to an m-point
data set, and the model captures the true behaviour of your system,
then the sum of the squared errors divided by the variances is
distributed according to a chi-squared distribution with m-n degrees
of freedom. So you can calculate this number and evaluate the
likelihood that the Gaussian noise would produce a quality-of-fit this
bad. If you get a tiny probability, you can be pretty sure your model
is not fitting the data adequately.
Also worth noting is the highly annoying situation where you have what
purports to be a good fit: the chi-squared gives you a decent
probability of noise producing a fit this bad, and the error bars on
the residuals all nicely include zero, but there's *still* a
discernible trend in the residuals. I interpret this to mean that not
only does my model not capture the system's behaviour, but I have also
overestimated my error bars: the fact that there's a good fit in spite
of the trend means that there is not as much scatter in the residuals
as there should be.
Anne