20.0 19209
20.5 20627
21.0 22177
21.5 23877
22.0 25730
22.5 27794
23.0 30165
23.5 32792
24.0 35610
24.5 38738
25.0 42113
25.5 46138
26.0 50652
26.5 55283
27.0 60204
I plot them in an xy-chart, fit a 6th degree polynomial trendline and
display the coefficients in the chart:
coeff.of x^6 = -1.1536
coeff.of x^5 = 160.93
coeff.of x^4 = -9333.8
coeff.of x^3 = 288113
coeff.of x^2 = -5e6
coeff.of x^1 = 5E7
coeff.of x^0 = -2E8 (= intercept)
When I compute the polynomial for the input points, I can't reproduce the
y-values.
Nothing new, so far, I have seen this before. So I increase the number of
significant digits to 8, 12 or whatever large number.
I still get nonsensical results.
Then I decide to compute the trendline myself using the following array
formula:
=linest(y,x^{1,2,3,4,5,6})
with the result for the above dataset:
coeff.of x^6 = -2.44965793
coeff.of x^5 = 343.7613651
coeff.of x^4 = -20056.29479,
coeff.of x^3 = 622756.8006,
coeff.of x^2 = -10853895.24,
coeff.of x^1 = 100677301.1
coeff.of x^0 = -388274281.1 (= intercept)
When I use these coefficients I get good agreement (in a least squares
sense) with the original data set.
Moreover, they are identical to the coefficients obtained with Excel's
regression tool from the Analysis Toolpack add-in.
But they are also dramatically different from the coefficients displayed on
the chart (see the first set of coefficients above)
(i.e. this NOT something that can be "simply" fixed by increasing the number
of significant digits).
To remove all doubt I ran the same dataset through MATLAB and confirmed the
correctness of the "linest" predictions.
As a final exercise I have "scaled" the independent variable: (x - 23.5) /
3.5 ranges between -1 and 1, and repeated the whole procedure.
Now the displayed trendline coefficients are in perfect agreement with the
computed coeficients (both from linest, regression and Matlab).
In other words it seems as though the Excel polynomial trendline coefficents
are only correct when the independent variable is first scaled or normalized
to lie within the range [-1,+1] or [0,1].
BTW: The same problems occur when trying to fit a lower degree (e.g. 4-th or
5-th) trendline to this dataset.
If you're still with me: any explanations will be most welcome. But it seems
to me like this is a straightforward bug in Excel and the workaround is not
just to increase the number of significant digits, as suggested in some MSKB
articles .....
Regards, Job Baar.
Hardly surprising. Fitting a 6th order polynomial is very illconditioned for
small smooth datasets. Your data reaches a nearly perfect fit with a cubic.
And it is probably only justifiable to fit a quadratic !
> Nothing new, so far, I have seen this before. So I increase the number of
> significant digits to 8, 12 or whatever large number.
> I still get nonsensical results.
At least part of the problem is overfitting the data. Subtract the mean values
of x and y and then watch the other coefficients flop about hopelessly.
> Then I decide to compute the trendline myself using the following array
> formula:
>
> =linest(y,x^{1,2,3,4,5,6})
Enable statistics with this and watch what happens as you increase the order of
the polynomial fit. The bulk of your data is explained by a quadratic fit, a
cubic at the outside.
> with the result for the above dataset:
>
> coeff.of x^6 = -2.44965793
> coeff.of x^5 = 343.7613651
> coeff.of x^4 = -20056.29479,
> coeff.of x^3 = 622756.8006,
> coeff.of x^2 = -10853895.24,
> coeff.of x^1 = 100677301.1
> coeff.of x^0 = -388274281.1 (= intercept)
>
> When I use these coefficients I get good agreement (in a least squares
> sense) with the original data set.
Though a quintic would give a technically closer least squares fit. The 6th
order polynomial is pushing the limits of numerical stability with these data
even with linest. Polynomial fitting has pathological behaviour if you try to
overfit your data.
> Moreover, they are identical to the coefficients obtained with Excel's
> regression tool from the Analysis Toolpack add-in.
It seems the polynomial fit code in the graph package is significantly less
numerically stable than the other one(s) in Excel. It does seem perverse of them
to have distinct implementations one of which is clearly unstable wrt the all
too common problem of users overfitting their data.
Still most people won't notice as they seldom if ever check the fit displayed on
their graphs.
> But they are also dramatically different from the coefficients displayed on
> the chart (see the first set of coefficients above)
> (i.e. this NOT something that can be "simply" fixed by increasing the number
> of significant digits).
> To remove all doubt I ran the same dataset through MATLAB and confirmed the
> correctness of the "linest" predictions.
>
> As a final exercise I have "scaled" the independent variable: (x - 23.5) /
> 3.5 ranges between -1 and 1, and repeated the whole procedure.
> Now the displayed trendline coefficients are in perfect agreement with the
> computed coeficients (both from linest, regression and Matlab).
> In other words it seems as though the Excel polynomial trendline coefficents
> are only correct when the independent variable is first scaled or normalized
> to lie within the range [-1,+1] or [0,1].
> BTW: The same problems occur when trying to fit a lower degree (e.g. 4-th or
> 5-th) trendline to this dataset.
It is removing the mean and rescaling that is protecting you from the
instabilities in the trendline fitting code. The problem stems from summing high
powers of numbers of the form A+e :
(A+e)^n ~ A^n + ne/A^(n-1) + ...
When A >> |e| and n is large the summation loses precision rapidly and contains
insufficient precision to allow you to reliably fit a high order polynomial.
Arranging that A=0 and |e|<1 greatly improves the condition of the fitting
matrix.
There is no justification for Excel to have two separate polynomial fitting
codes.
Especially since one of them demonstrates textbook numerical instabilities!
Moral of story: Be careful when fitting models to data. Do not overfit !
Regards,
Martin Brown
Tony
"Martin Brown" <martin...@pandora.be> wrote in message
news:393DFA03...@pandora.be...
> I don't think that one of the methods from Excel demostrates
> instabilities... what I think happens is that when the chart fit is used, it
> only gives you 1 significant digits for the coefficients that are in SCI
> notation, as the numbers are huge the error introduced by this is great.
Try it with his data and see. The polynomial fit is unstable for 5th & 6th order
polynomials. This *isn't* a cosmetic display fault. Linest gets a different
"solution" (and its 6th order fit is measurably worse in a least squares sense
than the 5th order one).
This is indisputable - the coefficients are radically different. Both
"solutions" have different amounts of a spurious alternating term polynomial
added to them. It is a recognised pathological behaviour of polynomial fitting
routines.
> Job is right in that you can probably find lots of solutions with a very
> similar "least square" error sum, the chart fit is converging to a different
> one than the add in. Anyway, I'm pretty sure that if you could ask the
> chart fit to give you more significant figures (for the coefficients in SCI
> notation) you would get pretty good results.
Not for these high order fits. The trendline equation on the charts is
different.
It more or less behaves itself when the data are not overfitted.
> That being so, I guess you could say that's a bug, although the answer I'm
> sure is right to that one significant digit.
It is pretty worrying that they have two different algorithms being used which
in extremis give radically different answers for the coefficients (even if the
fitted lines are similar). And that they do not warn users when they overfit
their data. Both are rough least squares solutions.
The chart trendline I got with these data was:
y = -1.15362917189E+00x6 + 1.60934413092E+02x5 - 9.33381485848E+03x4 +
2.88112860438E+05x3 - 4.99222017148E+06x2 + 4.60407028653E+07x -
1.76557154080E+08
R2 = 9.99986682823E-01
I was somewhat surprised that it is a *better* fit than the Linest result.
What is perhaps more interesting is that by my reckoning the trendline equation
is marginally closer to a true least squares solution. I checked the residuals
after computing the model fit.
Trendline Coefficients extracted in Sci format 11 dp gave chisquared ~ 31822
Whereas the solution from Linest was about 2x higher at ~ 78710
Note that a 5th order poly fit with both linest and trendline has chisquared ~
68917
ie *better* than linest's 6th order fit!!!
It seems that Linest is not quite as accurate as Trendline but the results *are*
different.
It is surprising that Matlab gives the same answer as Linest.
Regards,
Martin Brown
FYI: the following table contains the R-squared, standard error and ANOVA
F-test parameter and its significance as a function of the degree of the
fitted polynomial (all computed with Excel's built-in regression tool from
the analysis toolpack):
degree........R^2....................SE............F....................sign
ificance F
1..................0.963748498....2580.7....345.6.............9.53E-11
2..................0.999489462....318.8......11746.3........1.77E-20
3..................0.999956627....97.0.........84533.5.......2.85E-24
4..................0.999959436....98.4.........61628.0.......6.59E-22
5..................0.999971144....87.5.........62377.6.......3.85E-20
6..................0.999967044....99.2.........40456.3.......1.77E-17
It clearly confirms Martin's point that fitting anything more than a second
or third degree makes little sense. Having said that, I think it is perhaps
too much to expect from Excel to warn the user of the "Chart-Trendline"
against possibly overly zealous attempts to fit improper models. But I
maintain that this is basically a bug: when Excel advertises to return the
least-squares n-th degree polynomial, the returned coefficients should be
identical to those obtained with other identical methods (e.g. linest
formula, or regression from analysis toolpack) - irrespective of whether or
not the fitted model makes statistical sense or not.
Thanks again for your valuable contributions.
BTW: I have tried to find a link on the MS website where I can report this.
Is anybody aware of such a link?
Regards, Job Baar.
Martin Brown <martin...@pandora.be> wrote in message
news:393E5EAC...@pandora.be...
>
> Tony Duarte wrote:
>
> > I don't think that one of the methods from Excel demostrates
> > instabilities... what I think happens is that when the chart fit is
used, it
> > only gives you 1 significant digits for the coefficients that are in SCI
> > notation, as the numbers are huge the error introduced by this is great.
>
> Try it with his data and see. The polynomial fit is unstable for 5th & 6th
order
> polynomials. This *isn't* a cosmetic display fault. Linest gets a
different
> "solution" (and its 6th order fit is measurably worse in a least squares
sense
> than the 5th order one).
>
> This is indisputable - the coefficients are radically different. Both
> "solutions" have different amounts of a spurious alternating term
polynomial
> added to them. It is a recognised pathological behaviour of polynomial
fitting
> routines.
>
> > Job is right in that you can probably find lots of solutions with a very
> > similar "least square" error sum, the chart fit is converging to a
different
> > one than the add in. Anyway, I'm pretty sure that if you could ask the
> > chart fit to give you more significant figures (for the coefficients in
SCI
> > notation) you would get pretty good results.
>
> Not for these high order fits. The trendline equation on the charts is
> different.
> It more or less behaves itself when the data are not overfitted.
>
> > That being so, I guess you could say that's a bug, although the answer
I'm
> > sure is right to that one significant digit.
>
> It is pretty worrying that they have two different algorithms being used
which
> Thanks for all the feedback. As Martin correctly observed, this is a nice
> textbook problem in the art of numerical analysis.
>
> FYI: the following table contains the R-squared, standard error and ANOVA
> F-test parameter and its significance as a function of the degree of the
> fitted polynomial (all computed with Excel's built-in regression tool from
> the analysis toolpack):
>
> degree........R^2....................SE............F....................sign
> ificance F
> 1..................0.963748498....2580.7....345.6.............9.53E-11
> 2..................0.999489462....318.8......11746.3........1.77E-20
> 3..................0.999956627....97.0.........84533.5.......2.85E-24
> 4..................0.999959436....98.4.........61628.0.......6.59E-22
> 5..................0.999971144....87.5.........62377.6.......3.85E-20
> 6..................0.999967044....99.2.........40456.3.......1.77E-17
>
> It clearly confirms Martin's point that fitting anything more than a second
> or third degree makes little sense.
And highlights the problem with the 6th order Linest/Toolpack fit being wrong.
I prefer the square of the residuals as a direct measure of quality of fit.
It seems to me that the version in chart trendline is pretty good, and that the
agreement between Excel and Matlab is a pure fluke (or they share the same
numerical instability).
> Having said that, I think it is perhaps
> too much to expect from Excel to warn the user of the "Chart-Trendline"
> against possibly overly zealous attempts to fit improper models.
Why? Excel has all the statistical evidence, and far too many errors are caused
by people overfitting high order polynomials to small amounts of data. It should
certainly warn when it is on the edge of failing internally due to numerical
instabilities !
> But I maintain that this is basically a bug: when Excel advertises to return
> the
> least-squares n-th degree polynomial, the returned coefficients should be
> identical to those obtained with other identical methods (e.g. linest
> formula, or regression from analysis toolpack) - irrespective of whether or
> not the fitted model makes statistical sense or not.
The problem is deciding which routine is at fault. It is not that the chart
trendline is wrong - at least on your data it is giving a more accurate least
squares solution.
The problem is that the trendline is different from the answer you would get by
using the other functions in Excel which purport to do the same calculation.
They have used two different algorithms, it is hard to say for sure which one is
superior.
On these data at least the trendline is more accurate - it may not always be so.
The KB entry on this is vacuous and tries to paper over the cracks.
> BTW: I have tried to find a link on the MS website where I can report this.
> Is anybody aware of such a link?
You could try
http://support.microsoft.com/support/webresponse.asp?FR=0
I expect they will just point you back at the KB entry you already posted.
The Mathematica 4 solution is a bit better than Linest, but not quite as good as
Trendline.
with a chisquared ~ 64435
This shows how nasty the resulting matrix problem must be to solve.
I would have expected Mathematica to do better than that.
Regards,
Martin Brown
This data set appears to be from a system that is undergoing exponential growth.
A system of this type can be modeled using the equation:
y = a*EXP(b*x)
To see if a system is governed by an exponential equation, plot the data on an
x-y graph. Then change the y-scale to logrithmic (keep the x-scale linear). If
the system is exponetial, then the data should fall on a strait line.
You can fit a line through this data set using Excel's linear regression
function on the data set [x, LN(y)] (i.e., fill a new column with the LN(y)
values). Linear regression will give you:
LN(a) = intercept coefficient (in other words, a = EXP(intercept coefficient))
b = 'X variable 1' coefficent
I've made these calculations and for the data you gave in your example. The
equation:
y = 701.82*EXP(6.5536*x)
gives a very good fit to this data set. Systems that exhibit exponential growth
are very common in nature including: heat transfer problems, bacteria population
growth rates, and in chemical reaction rates. You can fit a polynomial curve to
a set of points generated by an exponential system. But the polynomial curve
will not tell you anything about the underlying properties of the system.
Best of luck
Paul O.
Job Baar wrote:
> Thanks for all the feedback. As Martin correctly observed, this is a nice
> textbook problem in the art of numerical analysis.
>
> FYI: the following table contains the R-squared, standard error and ANOVA
> F-test parameter and its significance as a function of the degree of the
> fitted polynomial (all computed with Excel's built-in regression tool from
> the analysis toolpack):
>
> degree........R^2....................SE............F....................sign
> ificance F
> 1..................0.963748498....2580.7....345.6.............9.53E-11
> 2..................0.999489462....318.8......11746.3........1.77E-20
> 3..................0.999956627....97.0.........84533.5.......2.85E-24
> 4..................0.999959436....98.4.........61628.0.......6.59E-22
> 5..................0.999971144....87.5.........62377.6.......3.85E-20
> 6..................0.999967044....99.2.........40456.3.......1.77E-17
>
> It clearly confirms Martin's point that fitting anything more than a second
> or third degree makes little sense. Having said that, I think it is perhaps
> too much to expect from Excel to warn the user of the "Chart-Trendline"
> against possibly overly zealous attempts to fit improper models. But I
> maintain that this is basically a bug: when Excel advertises to return the
> least-squares n-th degree polynomial, the returned coefficients should be
> identical to those obtained with other identical methods (e.g. linest
> formula, or regression from analysis toolpack) - irrespective of whether or
> not the fitted model makes statistical sense or not.
>
> Thanks again for your valuable contributions.
>
> BTW: I have tried to find a link on the MS website where I can report this.
> Is anybody aware of such a link?
>
> Regards, Job Baar.
>
> Martin Brown <martin...@pandora.be> wrote in message
> news:393E5EAC...@pandora.be...
> >
> > Tony Duarte wrote:
> >
> > > I don't think that one of the methods from Excel demostrates
> > > instabilities... what I think happens is that when the chart fit is
> used, it
> > > only gives you 1 significant digits for the coefficients that are in SCI
> > > notation, as the numbers are huge the error introduced by this is great.
> >
> > Try it with his data and see. The polynomial fit is unstable for 5th & 6th
> order
> > polynomials. This *isn't* a cosmetic display fault. Linest gets a
> different
> > "solution" (and its 6th order fit is measurably worse in a least squares
> sense
> > than the 5th order one).
> >
> > This is indisputable - the coefficients are radically different. Both
> > "solutions" have different amounts of a spurious alternating term
> polynomial
> > added to them. It is a recognised pathological behaviour of polynomial
> fitting
> > routines.
> >
> > > Job is right in that you can probably find lots of solutions with a very
> > > similar "least square" error sum, the chart fit is converging to a
> different
> > > one than the add in. Anyway, I'm pretty sure that if you could ask the
> > > chart fit to give you more significant figures (for the coefficients in
> SCI
> > > notation) you would get pretty good results.
> >
> > Not for these high order fits. The trendline equation on the charts is
> > different.
> > It more or less behaves itself when the data are not overfitted.
> >
> > > That being so, I guess you could say that's a bug, although the answer
> I'm
> > > sure is right to that one significant digit.
> >
> > It is pretty worrying that they have two different algorithms being used
> which
y = a*EXP(b*x)
y = 701.82*EXP(6.5536*x)
Job Baar wrote:
> Martin Brown <martin...@pandora.be> wrote in message
> news:393E5EAC...@pandora.be...
> >
> > Tony Duarte wrote:
> >
> > > I don't think that one of the methods from Excel demostrates
> > > instabilities... what I think happens is that when the chart fit is
> used, it
> > > only gives you 1 significant digits for the coefficients that are in SCI
> > > notation, as the numbers are huge the error introduced by this is great.
> >
> > Try it with his data and see. The polynomial fit is unstable for 5th & 6th
> order
> > polynomials. This *isn't* a cosmetic display fault. Linest gets a
> different
> > "solution" (and its 6th order fit is measurably worse in a least squares
> sense
> > than the 5th order one).
> >
> > This is indisputable - the coefficients are radically different. Both
> > "solutions" have different amounts of a spurious alternating term
> polynomial
> > added to them. It is a recognised pathological behaviour of polynomial
> fitting
> > routines.
> >
> > > Job is right in that you can probably find lots of solutions with a very
> > > similar "least square" error sum, the chart fit is converging to a
> different
> > > one than the add in. Anyway, I'm pretty sure that if you could ask the
> > > chart fit to give you more significant figures (for the coefficients in
> SCI
> > > notation) you would get pretty good results.
> >
> > Not for these high order fits. The trendline equation on the charts is
> > different.
> > It more or less behaves itself when the data are not overfitted.
> >
> > > That being so, I guess you could say that's a bug, although the answer
> I'm
> > > sure is right to that one significant digit.
> >
> > It is pretty worrying that they have two different algorithms being used
> which
You're quite right in pointing out that I haven't revealed where the data
comes from.
Actually it comes from a ship power prediction program: the independent
variable is speed in knots an the dependent variable is power in kW.
From a physical point of view, there is some support for a quadratic
dependency of ship resistance on speed and, hence, power should be cubically
dependent on speed (but this is somewhat of an oversimplification). Based on
my (not negligible) understanding of the subject, I believe that the good
quality of your exponential fit is somewhat fortuitous, but, hey, it gets
the job done if that's what one is after. Also, a significant part of the
resistance is due to wavemaking and that should depend on speed more or less
like A * exp (-B/V^2) where A and B are constants and V is speed.
Having said all that, the person that sent me the data only had a need to
interpolate (!) the dataset to one specific speed, say 24.2 knots. In order
to this he had made a spreadsheet which fitted the 6th degree polynomial
(supposedly least squares!) curve and used that equation to "interpolate".
The point is that he didn't really need a numerical model to approximate the
data set, but rather just some effective means to interpolate it at one
particular speed. I have since then advised him to do this by means of low
order polynomial interpolation through 2 to 4 points, or, perhaps even
better, by means of cubic spline interpolation, and have provided him with
an add-in to do so.
Regards, Job J.M. Baar
Paul Oosterhout <paul.oo...@noaa.gov> wrote in message
news:3946836B...@noaa.gov...