0 views

Skip to first unread message

Jun 28, 2005, 1:00:22 AM6/28/05

to

This is the continuation of the thread

"Fitting Data to a Multiple Regression Model. A Challenge."

It all started on March 17, during the course of my clarification

to a question asked by Bruce Weaver, suggesting that I might have

missed part of George Box's Inductive/Deductive Loop in George's

JASA article in the "model building" aspect of Science and

Statistics. After explaining Box's diagram, I wrote,

RF> For another occasion, I can post an article from my Lecture Note

RF> Chapter on "Statistical significance" vs "Practical Significance"

RF> in which I took a data set from the SPSS Manual in the early 1970s

RF> (p. 359 of the 1975 SPSS Manual, to be precise) in which the

multiple

RF> regression results (with three independent variables) were shown

RF> by SPSS to be statistically highly significant in every respect,

RF> but failed to notice that it was completely useless in PRACTICE,

RF> on the basis of prediction intervals.

RF> My lesson was to show that when properly analyzed through the

RF> "model building" process facilitated by the IDA package, a SIMPLE

RF> regression model, using only ONE of the three variables and the

RF> same data given in the manual, did a better job (in every respect)

RF> than the multiple regression model shown in the SPSS Manual.

RF> Sounds astounding? It is. And it was in black and white. I know

RF> some of you are curious about this, but you have to wait for

RF> another time and another place for it.

This apparently caught G. Robin Edwards' attention and interest, when

he posted two days later, on June 19,

RE> Totally lacking access to anything to with SPSS, I would

RE> be very pleased to be told how I might get hold of that

RE> data set. Sounds like an interesting and instructional

RE> exercise to try analysing it from a standing start.

After a week of unsuccessful plea for help in getting hold of the

dataset in an easily retrievable form, I manually typed the dataset

(as I had promised Robin that I would do, on Day 1, if I had to),

and the "Model Building" EXERCISE finally got off the ground

on June 27.

Jerry Dallal, three hours after I posted the data at 1:59 AM,

and a few hours laters, Russell Martin, already had solved my

intended LESSON 1 -- ALWAYS check the accuracy of the data and

examine them via some graphical displays.

By 10:06 AM June 27, the corrected data was in place:

RF> The early spotting of the gross typo took us to first base --

RF> there were TWO typos in the 1975 SPSS Manual. For subsequent

RF> anslysis, change the 1960 and 1961 values of GNP as follows:

RF> 15849 ----> 25849 (good guess by Russell)

RF> 25615 ----> 26515 (common transposition error, "56" for "65")

After a few rounds of further discussion and clarification about

the intent and purposes of the EXERCISE, treating the data as if it

were a cross-sectional data, as illustrated in the SPSS Manual, but

doing it badly, the race was on ... so far by one driver -- Jerry

Dallal. :-)

Perhaps prompted by my hint that one variable did a better job than

three (which prompted Robin's initial interest), Jerry quickly

made a giant step forward, by noting:

JD> FWIW, blindly throwing everything, including year, into a

JD> multiple linear regression equation (no interactions, no

JD> transformation; gives an RMS of 622. Using GNP alone, a

JD> linear-linear spline with a knot at 13770 gives an RMS of 335.

which is Jerry's way of saying he could beat the SPSS 3-variable

model with just ONE of the variables too -- a substantial one,

a reduction of MSE from 622 to 335.

In a sense, Jerry has turned from critic of the SPSS model to a

sponsor of his knot-and-spline model as an obviously "better one",

even without a fancy layout with a triple somersault and a twist.

A gave Jerry a solid endorsement, but with a little prompt for

analysis and discussion toward the next lessons,

RF> You can't eat RMS and you can't drink it.

Let's pause for a moment and reflect on George Box's of the ITERATIVE

cycles of critic and sponsor in a scientific model-building process.

I am in complete agreement with what George had to say in his 1976

JASA article on "Science and Statistics", and had this to say myself

in my "Interactive Data Analysis" article in the Encyclopedia of

Statistical Sciences":

"At the exploratory stage, the analyst does numerical detective work

via graphic and semigraphic displays, and often does a variety of

tasks in data editing, < ... > At the model building stage, the

analyst acts both as a sponsor and a critic of one or more tentative

models. < ... > At the conclusion of an iterative process of

probing, confirmatory analysis, if it's done at all, is seldom a

major part of the entire analysis." (Volume 6, p. 187)

It's almost 24 hours since I posted the SPSS Manual data (with

two gross errors in it). Jerry has already made much progress along

the line I described in my encyclopedia article, short of sponsoring

an actual model and scrutinizing it's performance.

I hope when I continue with this LESSON 2, there will be models

sponsored by OTHERS, before I show the model I sponsored 30 years

ago, before continuing to LESSON 3, the seldom noticed or discussed

topic of PRACTICAL significance vs STATISTICAL significance.

Goodnight. It's been a good day in statistical discussions.

-- Bob.

Jun 28, 2005, 11:17:01 AM6/28/05

to

Reef Fish wrote:

> This is the continuation of the thread

>

> "Fitting Data to a Multiple Regression Model. A Challenge."

>

>

> This apparently caught G. Robin Edwards' attention and interest, when

> he posted two days later, on June 19,

>

> RE> Totally lacking access to anything to with SPSS, I would

> RE> be very pleased to be told how I might get hold of that

> RE> data set. Sounds like an interesting and instructional

> RE> exercise to try analysing it from a standing start.

>

> I hope when I continue with this LESSON 2, there will be models

> sponsored by OTHERS, before I show the model I sponsored 30 years

> ago, before continuing to LESSON 3, the seldom noticed or discussed

> topic of PRACTICAL significance vs STATISTICAL significance.

Robin, having laboriously typed in the data specially for you, I

hesitate to continue any further without getting some input/output

from you.

How are you doing from your standing start, now that you have gotten

some hints from previous posts on the EXERCISE?

-- Bob.

Jun 28, 2005, 1:05:33 PM6/28/05

to

All ,

entitled " Exploratory Time Series Analysis " with kudos to John Tukey

and G.E.P. Box

When conducting a regression analysis of time series data one has to

consider a number of relevant issues.

1. Is the mean of the errors from a tentative model zero throughout or

are there identifiable Pulses, Level Shifts , Seasonal Pulses thus

violating the First Commandment . This example of 32 values has anumber

of "unusual" values. Unusual values require a determination of what is

usual or what is signal ( the results from the equation ). Routine

identification of Interventions immediately point out the 1960 anaomaly

( point 26 of 32 ) and a smaller one at point 27

2. By pre-whitening one can use correlative tools to identify possible

relationships between time series data in order to identify a possible

Transfer Function (regression model with time series data ) we can

intelligently pre-select any necessary lags in any of the input series

that might be useful. This lead to a model with significant structure

with the following

MODEL COMPONENT LAG COEFF STANDARD P

T

# (BOP) ERROR VALUE

VALUE

1CONSTANT -111. 12.3 .0000

-9.05

INPUT SERIES X1 GNP

2Omega (input) -Factor # 1 0 .149E-01 .282E-02 .0000

5.28

INPUT SERIES X2 C_PROF

3Omega (input) -Factor # 2 0 -.141 .473E-01 .0062

-2.98

INPUT SERIES X3 C_DIVD

4Omega (input) -Factor # 3 0 .428 .673E-01 .0000

6.35

INPUT SERIES X4 I~P00026 26 PULSE

5Omega (input) -Factor # 4 0 145. 32.3 .0001

4.49

INPUT SERIES X5 I~P00027 27 PULSE

6Omega (input) -Factor # 5 0 54.8 22.7 .0232

2.41

Number of Residuals (R) =n 32

Number of Degrees of Freedom =n-m 26

Sum of Squares =Sum R**2 15429.9

Variance var=SOS/(n) 482.183

R Square =

.960124

The R Square includes the effect of the two Pulses thus is vaastly

overstated.

So far ... nothing out of the order.

3. One has to verify that the error variance from the proposed model is

invariant over time. Examination of the need for a power transform (

logs ) and a variance statbilization transform (GARCH) and the

possibility of a structural change in variance yielded nothing

statistically significant. If one of these tests had concluded positive

then either a BOX-COX approach or Generlaixed Least Squares approach

where the H transformation weights would be gleened form the GARCH

Model or the significant brek points detected form the Tsay Variance

Change Test

4. One now must test for transience or structural break points in

parameters over time. Just because 32 values exist this does not mean

that all 32 should be used to either form the model and or to estimate

appropiate parameters. By employing an efficient grid search one can

find that the there is a statistically significant difference between

1935-1944 and 1945 -1966 suggesting that one should either model the

transience in the model parameters or ( as we did ) simply use the most

recent 22 years.

The P value for the Chow F test of the regression coefficients for

1935-1944 vs 1945-1966 was .0000256 thus we now continue

with an analysis of the last 22 years

5. The model for the las 22 years is

1CONSTANT -179. 18.5 .0000

-9.70

INPUT SERIES X1 C_PROF

2Omega (input) -Factor # 1 0 .252 .378E-01 .0000

6.66

INPUT SERIES X2 C_DIVD

3Omega (input) -Factor # 2 0 .244 .840E-01 .0110

2.90

INPUT SERIES X3 I~P00017 17 PULSE

4Omega (input) -Factor # 3 0 68.1 12.0 .0000

5.69

INPUT SERIES X4 I~P00006 6 PULSE

5Omega (input) -Factor # 4 0 -56.7 13.0 .0006

-4.36

INPUT SERIES X5 I~P00003 3 PULSE

6Omega (input) -Factor # 5 0 -47.6 11.9 .0012

-3.98

INPUT SERIES X6 I~P00010 10 PULSE

7Omega (input) -Factor # 6 0 36.1 12.3 .0105

2.92

where time period 17 of 22 perios is 1961 ; 6 reflects 1950 ; 3 1947

and 10 = 1954 with

an R-Squared ( based on 22 data points ) of .9885

Number of Residuals (R) =n 22

Number of Degrees of Freedom =n-m 15

Sum of Squares =Sum R**2 2830.47

Variance var=SOS/(n) 128.658

Adjusted Variance =SOS/(n-m) 188.698

R Square = .988548

Summary:

The modeler needs to be aware that he must safequard and up-armor the

model in order

to meet the Gaussian Requirements . I hope this helps and I submit it

as a show of thanks to ReefFish

for attempting to raise our collective standards for analysis. At the

end of the the modeler must reasonable "prove"

that the model/parameters are invariant over time ; that the variance

of the errors is homogenous over time and

that the mean of the errors is zero everywhere or at least accept the

hypotheses that these violations

are not statistically significant.

Regards

Dave Reilly

AUTOMATIC FORECASTING SYSTEMS

http://www.autobox.com

P.S I have posted all the details of this analysis on our web site ..

http://www.autobox.com/reeffish

Jun 28, 2005, 2:17:10 PM6/28/05

to

da...@autobox.com wrote:

> All ,

Dave, thanks for your submission of your autobox analysis for the

discussion of the model-building aspects of the data set.

I was beginning to get the feeling of giving a party and no one

attended. :-)

>

> entitled " Exploratory Time Series Analysis " with kudos to John Tukey

> and G.E.P. Box

>

> When conducting a regression analysis of time series data one has to

> consider a number of relevant issues.

While I had intended the analysis to be the cross-sectional regression

type (mostly for comparison to the SPSS Manual example) even though

the data is more appropriately analyzed as multiple time series, I

do welcome your TIME SERIES analysis, because it'll be a part of

the final discussion on the difference between the two approaches

for the particular data.

While we are still waiting for some non-time-series analyses to be

submitted, I want to ask you to clarify HOW MUCH of what you've

shown was "automatic" (as suggested by 'autobox', a package with

which I am not familiar), as opposed to "manual", choice made by

YOU (or the user) during the process.

E.g., the detection of the two gross outliers in GNP, the choice

of the break point or what segment of the series to analyze, etc.

Some of this might be apparent in the detailed information given

in your webpage -- which I'll examine in greater detail if and

when time permits.

Thanks again for your contribution. I'll address your comments as

well as the results probably at the end of LESSON 3 (or whatever

number), at the conclusion of the SPSS-like (non-time-series)

analysis, before comparing those results to your time-series ones.

-- Bob.

Jun 28, 2005, 3:02:39 PM6/28/05

to

Bob et al ..

Totally automatic .... from beginning to end ...

Whereas you are not familar with AUTOBOX or FreeFore you were familiar

with it's grandfather .. you purchased and used THE PACK PROGRAM when

you were in your former life at that UnNamed University. You and others

might be interested in a 50 year perspective on econometric programs.

http://www.autobox.com/pdfs/50YEARS.PDF and

http://www.autobox.com/pdfs/econometrics.pdf

regards

Dave Reilly

AFS

Jun 28, 2005, 3:16:37 PM6/28/05

to

da...@autobox.com wrote:

> Bob et al ..

>

> Totally automatic .... from beginning to end ...

Thanks for the clarification. Very interesting and quite impressive

in certain parts of the automata.

>

> Whereas you are not familar with AUTOBOX or FreeFore you were familiar

> with it's grandfather .. you purchased and used THE PACK PROGRAM when

> you were in your former life at that UnNamed University. You and others

> might be interested in a 50 year perspective on econometric programs.

I might have known it great-great-great-great grand father, through

something Harry Roberts might have done, but I have never USED or

PURCHASED any "PACK PROGRAM" at any of my former universities. :-)

Someone at those universities may have purchased (and paid for)

something in my name without my knowledge. As long as someone else

is PAYING, I usually don't mind. :-) How I am curious as to WHEN

that happened. Send me a private note to my posting address.

No discredit on AUTOBOX or any of its ancestors. Just want to clarify

your comment, for the record. :)

>

> http://www.autobox.com/pdfs/50YEARS.PDF and

> http://www.autobox.com/pdfs/econometrics.pdf

>

> regards

>

> Dave Reilly

> AFS

-- Bob.

Jun 29, 2005, 2:06:15 AM6/29/05

to

I started this thread less than 24 hours ago, but thanks to the

participations by Jerry Dallal, Robin Edwards, and Dave Reilly,

I can finish this LESSON and move on to the next one later today!

participations by Jerry Dallal, Robin Edwards, and Dave Reilly,

I can finish this LESSON and move on to the next one later today!

This LESSON pertains to what I called the "golden goose", in my

follow-up to (and analysis of) Robin Edward's post, found during

the iterative loop of the model-building process.

It's interesting in a way that all FOUR of us found the same

"hockey stick" in the scatterplot of the INVDEX variable vs the

GNP variable. All four of us took DIFFERENT actions!!

In that respect, if we were working as a TEAM, we would put our

heads together on the four TENTATIVE models and decide what model

to sponsor next (if any).

Robin, understandably, let the golden goose slip away, and ended

up with the Daffy Duck, better than the SPSS Manual result (with

two huge errors in the data unnoticed) nevertheless,

INVDEX = -119 + 0.87621*C.DIVD.

This was based on all 32 observations, and would have yield a

MSE of 1582, which is almost 5 times the MSE (or RMS) of 335 Jerry

got <in his previous model> with GNP alone as the predictor,

which was about HALF the RMS of 622 of the SPSS-like multiple

regression model <data blunders corrected> with the kitchen

sink thrown in.

Jerry ended with his knotty model, estimating the breakpoint between

1941 and 1042 (between the 7th and 8th row of the data matrix)

invdex = 181.9 - 0.016 gnp +

+ 0.028 I(gnp>=13770)*(gnp-13770)

+ 0.114 cprof + 56.7 I(year=1961)

where I(x)=1 if x is true and 0 otherwise

AUTOBOX found a breakpoint automatically, as a multiple time-series

model using only the last 22 periods, or starting at 1945 (the 11th

row of the data matrix, using 7 degress of freedom in the fit:

an R-Squared ( based on 22 data points ) of .9885

Number of Residuals (R) =n 22

Number of Degrees of Freedom =n-m 15

Sum of Squares =Sum R**2 2830.47

It's time to show my "antique model" of 30 years ago and how I

arrived at it. The "hockey stick" was noticed almost immediately,

in the scatterplot of INVDEX vs GNP. But what to DO with it was

another matter.

Then it occured to me that it made sense for the relation to be

nearly horizontal during the WWII era and then both variables

were growing strong in a linear fit in the post-war years.

Since the object of the fit is for PREDICTION purposes, I took

the "simplicity" approach of fitting the data ONLY between the

years 1943-1965, holding out 1966 for validation purposes.

That was IT! The "golden egg" from the golden goose. The egg

I found 30 years ago was about half way between Jerry's and

the AUTOBOX one. The elbow of the hockey stick was 1941.5

for Jerry, 1943 by me, and 1945 by AUTOBOX.

The SIMPLE regression of INVDEX vs GNP for those years produced

well behaved residuals in normality, independence, and homoscedas-

ticity, and a fit of

INVDEX = -197.51 + 0.018234 * GNP

(15.031) (.000667)

T=-13.14 T=27.33

p-value 10^(-12)

Multiple R-sq = 0.9726, MSE = 317.96.

All very impressive and highly statistically significant.

It would have been Utopia for the social scientists who are

accustomed to getting excited over results with R^2 in the 0.4

range. :-)

We now move on to LESSON 3 (later today, after some sleep) on the

Reality Check of the PRACTICAL significance of this result vs its

STATISTICAL significance which is so high it borders the obscene. ;)

Spurious? Tune in later for the Reality Check in LESSON 3.

-- Bob.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu