"Fitting Data to a Multiple Regression Model. A Challenge."
It all started on March 17, during the course of my clarification
to a question asked by Bruce Weaver, suggesting that I might have
missed part of George Box's Inductive/Deductive Loop in George's
JASA article in the "model building" aspect of Science and
Statistics. After explaining Box's diagram, I wrote,
RF> For another occasion, I can post an article from my Lecture Note
RF> Chapter on "Statistical significance" vs "Practical Significance"
RF> in which I took a data set from the SPSS Manual in the early 1970s
RF> (p. 359 of the 1975 SPSS Manual, to be precise) in which the
RF> regression results (with three independent variables) were shown
RF> by SPSS to be statistically highly significant in every respect,
RF> but failed to notice that it was completely useless in PRACTICE,
RF> on the basis of prediction intervals.
RF> My lesson was to show that when properly analyzed through the
RF> "model building" process facilitated by the IDA package, a SIMPLE
RF> regression model, using only ONE of the three variables and the
RF> same data given in the manual, did a better job (in every respect)
RF> than the multiple regression model shown in the SPSS Manual.
RF> Sounds astounding? It is. And it was in black and white. I know
RF> some of you are curious about this, but you have to wait for
RF> another time and another place for it.
This apparently caught G. Robin Edwards' attention and interest, when
he posted two days later, on June 19,
RE> Totally lacking access to anything to with SPSS, I would
RE> be very pleased to be told how I might get hold of that
RE> data set. Sounds like an interesting and instructional
RE> exercise to try analysing it from a standing start.
After a week of unsuccessful plea for help in getting hold of the
dataset in an easily retrievable form, I manually typed the dataset
(as I had promised Robin that I would do, on Day 1, if I had to),
and the "Model Building" EXERCISE finally got off the ground
on June 27.
Jerry Dallal, three hours after I posted the data at 1:59 AM,
and a few hours laters, Russell Martin, already had solved my
intended LESSON 1 -- ALWAYS check the accuracy of the data and
examine them via some graphical displays.
By 10:06 AM June 27, the corrected data was in place:
RF> The early spotting of the gross typo took us to first base --
RF> there were TWO typos in the 1975 SPSS Manual. For subsequent
RF> anslysis, change the 1960 and 1961 values of GNP as follows:
RF> 15849 ----> 25849 (good guess by Russell)
RF> 25615 ----> 26515 (common transposition error, "56" for "65")
After a few rounds of further discussion and clarification about
the intent and purposes of the EXERCISE, treating the data as if it
were a cross-sectional data, as illustrated in the SPSS Manual, but
doing it badly, the race was on ... so far by one driver -- Jerry
Perhaps prompted by my hint that one variable did a better job than
three (which prompted Robin's initial interest), Jerry quickly
made a giant step forward, by noting:
JD> FWIW, blindly throwing everything, including year, into a
JD> multiple linear regression equation (no interactions, no
JD> transformation; gives an RMS of 622. Using GNP alone, a
JD> linear-linear spline with a knot at 13770 gives an RMS of 335.
which is Jerry's way of saying he could beat the SPSS 3-variable
model with just ONE of the variables too -- a substantial one,
a reduction of MSE from 622 to 335.
In a sense, Jerry has turned from critic of the SPSS model to a
sponsor of his knot-and-spline model as an obviously "better one",
even without a fancy layout with a triple somersault and a twist.
A gave Jerry a solid endorsement, but with a little prompt for
analysis and discussion toward the next lessons,
RF> You can't eat RMS and you can't drink it.
Let's pause for a moment and reflect on George Box's of the ITERATIVE
cycles of critic and sponsor in a scientific model-building process.
I am in complete agreement with what George had to say in his 1976
JASA article on "Science and Statistics", and had this to say myself
in my "Interactive Data Analysis" article in the Encyclopedia of
"At the exploratory stage, the analyst does numerical detective work
via graphic and semigraphic displays, and often does a variety of
tasks in data editing, < ... > At the model building stage, the
analyst acts both as a sponsor and a critic of one or more tentative
models. < ... > At the conclusion of an iterative process of
probing, confirmatory analysis, if it's done at all, is seldom a
major part of the entire analysis." (Volume 6, p. 187)
It's almost 24 hours since I posted the SPSS Manual data (with
two gross errors in it). Jerry has already made much progress along
the line I described in my encyclopedia article, short of sponsoring
an actual model and scrutinizing it's performance.
I hope when I continue with this LESSON 2, there will be models
sponsored by OTHERS, before I show the model I sponsored 30 years
ago, before continuing to LESSON 3, the seldom noticed or discussed
topic of PRACTICAL significance vs STATISTICAL significance.
Goodnight. It's been a good day in statistical discussions.
Reef Fish wrote:
> This is the continuation of the thread
> "Fitting Data to a Multiple Regression Model. A Challenge."
> This apparently caught G. Robin Edwards' attention and interest, when
> he posted two days later, on June 19,
> RE> Totally lacking access to anything to with SPSS, I would
> RE> be very pleased to be told how I might get hold of that
> RE> data set. Sounds like an interesting and instructional
> RE> exercise to try analysing it from a standing start.
> I hope when I continue with this LESSON 2, there will be models
> sponsored by OTHERS, before I show the model I sponsored 30 years
> ago, before continuing to LESSON 3, the seldom noticed or discussed
> topic of PRACTICAL significance vs STATISTICAL significance.
Robin, having laboriously typed in the data specially for you, I
hesitate to continue any further without getting some input/output
How are you doing from your standing start, now that you have gotten
some hints from previous posts on the EXERCISE?
entitled " Exploratory Time Series Analysis " with kudos to John Tukey
and G.E.P. Box
When conducting a regression analysis of time series data one has to
consider a number of relevant issues.
1. Is the mean of the errors from a tentative model zero throughout or
are there identifiable Pulses, Level Shifts , Seasonal Pulses thus
violating the First Commandment . This example of 32 values has anumber
of "unusual" values. Unusual values require a determination of what is
usual or what is signal ( the results from the equation ). Routine
identification of Interventions immediately point out the 1960 anaomaly
( point 26 of 32 ) and a smaller one at point 27
2. By pre-whitening one can use correlative tools to identify possible
relationships between time series data in order to identify a possible
Transfer Function (regression model with time series data ) we can
intelligently pre-select any necessary lags in any of the input series
that might be useful. This lead to a model with significant structure
with the following
MODEL COMPONENT LAG COEFF STANDARD P
# (BOP) ERROR VALUE
1CONSTANT -111. 12.3 .0000
INPUT SERIES X1 GNP
2Omega (input) -Factor # 1 0 .149E-01 .282E-02 .0000
INPUT SERIES X2 C_PROF
3Omega (input) -Factor # 2 0 -.141 .473E-01 .0062
INPUT SERIES X3 C_DIVD
4Omega (input) -Factor # 3 0 .428 .673E-01 .0000
INPUT SERIES X4 I~P00026 26 PULSE
5Omega (input) -Factor # 4 0 145. 32.3 .0001
INPUT SERIES X5 I~P00027 27 PULSE
6Omega (input) -Factor # 5 0 54.8 22.7 .0232
Number of Residuals (R) =n 32
Number of Degrees of Freedom =n-m 26
Sum of Squares =Sum R**2 15429.9
Variance var=SOS/(n) 482.183
R Square =
The R Square includes the effect of the two Pulses thus is vaastly
So far ... nothing out of the order.
3. One has to verify that the error variance from the proposed model is
invariant over time. Examination of the need for a power transform (
logs ) and a variance statbilization transform (GARCH) and the
possibility of a structural change in variance yielded nothing
statistically significant. If one of these tests had concluded positive
then either a BOX-COX approach or Generlaixed Least Squares approach
where the H transformation weights would be gleened form the GARCH
Model or the significant brek points detected form the Tsay Variance
4. One now must test for transience or structural break points in
parameters over time. Just because 32 values exist this does not mean
that all 32 should be used to either form the model and or to estimate
appropiate parameters. By employing an efficient grid search one can
find that the there is a statistically significant difference between
1935-1944 and 1945 -1966 suggesting that one should either model the
transience in the model parameters or ( as we did ) simply use the most
recent 22 years.
The P value for the Chow F test of the regression coefficients for
1935-1944 vs 1945-1966 was .0000256 thus we now continue
with an analysis of the last 22 years
5. The model for the las 22 years is
1CONSTANT -179. 18.5 .0000
INPUT SERIES X1 C_PROF
2Omega (input) -Factor # 1 0 .252 .378E-01 .0000
INPUT SERIES X2 C_DIVD
3Omega (input) -Factor # 2 0 .244 .840E-01 .0110
INPUT SERIES X3 I~P00017 17 PULSE
4Omega (input) -Factor # 3 0 68.1 12.0 .0000
INPUT SERIES X4 I~P00006 6 PULSE
5Omega (input) -Factor # 4 0 -56.7 13.0 .0006
INPUT SERIES X5 I~P00003 3 PULSE
6Omega (input) -Factor # 5 0 -47.6 11.9 .0012
INPUT SERIES X6 I~P00010 10 PULSE
7Omega (input) -Factor # 6 0 36.1 12.3 .0105
where time period 17 of 22 perios is 1961 ; 6 reflects 1950 ; 3 1947
and 10 = 1954 with
an R-Squared ( based on 22 data points ) of .9885
Number of Residuals (R) =n 22
Number of Degrees of Freedom =n-m 15
Sum of Squares =Sum R**2 2830.47
Variance var=SOS/(n) 128.658
Adjusted Variance =SOS/(n-m) 188.698
R Square = .988548
The modeler needs to be aware that he must safequard and up-armor the
model in order
to meet the Gaussian Requirements . I hope this helps and I submit it
as a show of thanks to ReefFish
for attempting to raise our collective standards for analysis. At the
end of the the modeler must reasonable "prove"
that the model/parameters are invariant over time ; that the variance
of the errors is homogenous over time and
that the mean of the errors is zero everywhere or at least accept the
hypotheses that these violations
are not statistically significant.
AUTOMATIC FORECASTING SYSTEMS
P.S I have posted all the details of this analysis on our web site ..
> All ,
Dave, thanks for your submission of your autobox analysis for the
discussion of the model-building aspects of the data set.
I was beginning to get the feeling of giving a party and no one
> entitled " Exploratory Time Series Analysis " with kudos to John Tukey
> and G.E.P. Box
> When conducting a regression analysis of time series data one has to
> consider a number of relevant issues.
While I had intended the analysis to be the cross-sectional regression
type (mostly for comparison to the SPSS Manual example) even though
the data is more appropriately analyzed as multiple time series, I
do welcome your TIME SERIES analysis, because it'll be a part of
the final discussion on the difference between the two approaches
for the particular data.
While we are still waiting for some non-time-series analyses to be
submitted, I want to ask you to clarify HOW MUCH of what you've
shown was "automatic" (as suggested by 'autobox', a package with
which I am not familiar), as opposed to "manual", choice made by
YOU (or the user) during the process.
E.g., the detection of the two gross outliers in GNP, the choice
of the break point or what segment of the series to analyze, etc.
Some of this might be apparent in the detailed information given
in your webpage -- which I'll examine in greater detail if and
when time permits.
Thanks again for your contribution. I'll address your comments as
well as the results probably at the end of LESSON 3 (or whatever
number), at the conclusion of the SPSS-like (non-time-series)
analysis, before comparing those results to your time-series ones.
Totally automatic .... from beginning to end ...
Whereas you are not familar with AUTOBOX or FreeFore you were familiar
with it's grandfather .. you purchased and used THE PACK PROGRAM when
you were in your former life at that UnNamed University. You and others
might be interested in a 50 year perspective on econometric programs.
> Bob et al ..
> Totally automatic .... from beginning to end ...
Thanks for the clarification. Very interesting and quite impressive
in certain parts of the automata.
> Whereas you are not familar with AUTOBOX or FreeFore you were familiar
> with it's grandfather .. you purchased and used THE PACK PROGRAM when
> you were in your former life at that UnNamed University. You and others
> might be interested in a 50 year perspective on econometric programs.
I might have known it great-great-great-great grand father, through
something Harry Roberts might have done, but I have never USED or
PURCHASED any "PACK PROGRAM" at any of my former universities. :-)
Someone at those universities may have purchased (and paid for)
something in my name without my knowledge. As long as someone else
is PAYING, I usually don't mind. :-) How I am curious as to WHEN
that happened. Send me a private note to my posting address.
No discredit on AUTOBOX or any of its ancestors. Just want to clarify
your comment, for the record. :)
This LESSON pertains to what I called the "golden goose", in my
follow-up to (and analysis of) Robin Edward's post, found during
the iterative loop of the model-building process.
It's interesting in a way that all FOUR of us found the same
"hockey stick" in the scatterplot of the INVDEX variable vs the
GNP variable. All four of us took DIFFERENT actions!!
In that respect, if we were working as a TEAM, we would put our
heads together on the four TENTATIVE models and decide what model
to sponsor next (if any).
Robin, understandably, let the golden goose slip away, and ended
up with the Daffy Duck, better than the SPSS Manual result (with
two huge errors in the data unnoticed) nevertheless,
INVDEX = -119 + 0.87621*C.DIVD.
This was based on all 32 observations, and would have yield a
MSE of 1582, which is almost 5 times the MSE (or RMS) of 335 Jerry
got <in his previous model> with GNP alone as the predictor,
which was about HALF the RMS of 622 of the SPSS-like multiple
regression model <data blunders corrected> with the kitchen
sink thrown in.
Jerry ended with his knotty model, estimating the breakpoint between
1941 and 1042 (between the 7th and 8th row of the data matrix)
invdex = 181.9 - 0.016 gnp +
+ 0.028 I(gnp>=13770)*(gnp-13770)
+ 0.114 cprof + 56.7 I(year=1961)
where I(x)=1 if x is true and 0 otherwise
AUTOBOX found a breakpoint automatically, as a multiple time-series
model using only the last 22 periods, or starting at 1945 (the 11th
row of the data matrix, using 7 degress of freedom in the fit:
an R-Squared ( based on 22 data points ) of .9885
Number of Residuals (R) =n 22
Number of Degrees of Freedom =n-m 15
Sum of Squares =Sum R**2 2830.47
It's time to show my "antique model" of 30 years ago and how I
arrived at it. The "hockey stick" was noticed almost immediately,
in the scatterplot of INVDEX vs GNP. But what to DO with it was
Then it occured to me that it made sense for the relation to be
nearly horizontal during the WWII era and then both variables
were growing strong in a linear fit in the post-war years.
Since the object of the fit is for PREDICTION purposes, I took
the "simplicity" approach of fitting the data ONLY between the
years 1943-1965, holding out 1966 for validation purposes.
That was IT! The "golden egg" from the golden goose. The egg
I found 30 years ago was about half way between Jerry's and
the AUTOBOX one. The elbow of the hockey stick was 1941.5
for Jerry, 1943 by me, and 1945 by AUTOBOX.
The SIMPLE regression of INVDEX vs GNP for those years produced
well behaved residuals in normality, independence, and homoscedas-
ticity, and a fit of
INVDEX = -197.51 + 0.018234 * GNP
Multiple R-sq = 0.9726, MSE = 317.96.
All very impressive and highly statistically significant.
It would have been Utopia for the social scientists who are
accustomed to getting excited over results with R^2 in the 0.4
We now move on to LESSON 3 (later today, after some sleep) on the
Reality Check of the PRACTICAL significance of this result vs its
STATISTICAL significance which is so high it borders the obscene. ;)
Spurious? Tune in later for the Reality Check in LESSON 3.