Fitting Data to a Multiple Regression Model: A Challenge

5 views
Skip to first unread message

Reef Fish

unread,
Jun 27, 2005, 1:57:44 AM6/27/05
to
On June 19, Robin Edwards wrote, regarding a data set in SPSS
I discussed, involving "model building",

RE> Totally lacking access to anything to with SPSS, I would
RE> be very pleased to be told how I might get hold of that
RE> data set. Sounds like an interesting and instructional
RE> exercise to try analysing it from a standing start.

On June 21, I wrote

RF> I don't see many eager volunteers to help on providing the
RF> data. Let me try to see if that data is STILL used in the
RF> current SPSS Manual.

Not getting any response about the data set, I wrote on June 22,

"I just remembered now, that I review the book by Cox and Snell,
"Applied Statistics, Principles and Practice", in JASA (1984,
229-231) in which I re-analyzed their most detailed analysis in
the book, a multiple regression model building example, and
showed that as carefully as they showed many things THEY
considered that most other analysts would have missed, that
THEIR analysis was still too cut-and-dry. I produced several
alternative models (based on their data) in my review that
proved superior to their final model by significant PRACTICAL
significance margins as well as statistical significance. "

"JASA is widely accessible. Perhaps you or other readers may be
interested in taking a look at THAT example while I give it
another couple of days for anyone to help out with the SPSS data
before I type it MYSELF, and challenge everyone who thinks they
are good at "model building" to try their hands on it. "


I don't think any help is forthcoming, so here's the data, from
the 1975 SPSS Manual:

INVDEX Investors Index 1940 = 100
GNP Gross National Product (scaled)
C.PROF Corporate Profits before taxes
C.DIVD Corporate Dividends paid

INVDEX GNP C.PROF C.DIVD

76.4 7678 269 216 (1935)
99.5 8022 351 251 (1936)
105.9 8820 403 250
86.7 8871 362 290
83.7 9536 541 304
70.7 10911 619 317
61.7 12486 801 273
58.7 14816 917 243
76.3 15357 882 233
76.6 15927 858 211
91 15552 852 195
105.8 15251 966 230
96.8 15446 1008 286
102.8 15735 908 240
100 16343 851 278
120.3 17471 1065 361
153.8 18547 1034 300
158.2 20027 1081 296
146.5 20794 1089 287
165.6 20186 953 282
212.7 21920 1206 321
245.9 23811 1313 340
236 24117 1202 364
218.8 24397 1242 371
242.6 25242 1378 388
256.9 15849 1295 397
326.1 25615 1314 436
314.4 28287 1422 470
336 29740 1525 511
394 31650 1718 583
433.1 33814 1836 629 (1965)
408.5 35822 1762 655 (1966)


The example was used to illustrate the how to use the Multiple
Regression Procedure to fit INVDEX to three predictor variables
(GNP, C.PROF, and C.DIVD) and what outputs are given. The output
showed EVERYTHING to be highly statistically significant (anyone
can run this set of data and will notice that, which is obvious).

It turned out that the highly statistically significant results
are highly DECEPTIVE!

There are many hidden and important LESSONS behind the ROUTINE
fitting of this data set (let's say for prediction purposes).

This would be a TYPICAL problem in "model building" using multiple
regression methods -- that's all the hint I am going to give now.

If you think you fitting the data as given would yield a "good
result" (as was given in the SPSS Manual), because the R^2 is
well over 0.9, or whatever else that impressed you about the
fit, you deserve to FLUNK any course in "model building" OR
"data analysis"! Not even a low D!

So, the challenge is to do SOMETHING right! There is no "best"
answer or model for the data, but some are clearly far superior
to others -- that's the usual result in almost all problems of
fitting models to data. For this EXERCISE, there is NO need to
attempt to explain anything in terms of cause or influence.
Just a matter of FITTING and use the model for prediction
intervals at various values of the predictors.


So, there's something to be done besides sutffing the data into
the SPSS package and get the standard multiple regression results.

Let's see what discoveries and what model anyone might have found
or points to make for this exercise.

-- Bob.

Anon.

unread,
Jun 27, 2005, 3:59:54 AM6/27/05
to
<snip data>

>
> The example was used to illustrate the how to use the Multiple
> Regression Procedure to fit INVDEX to three predictor variables
> (GNP, C.PROF, and C.DIVD) and what outputs are given. The output
> showed EVERYTHING to be highly statistically significant (anyone
> can run this set of data and will notice that, which is obvious).
>
Actually, this is not necessarily true. The third model I tried was
this (output from R):

> reg3=lm(NVDEX~GNP+C.DIVD+C.PROF)
> anova(reg3)
Analysis of Variance Table

Response: NVDEX
Df Sum Sq Mean Sq F value Pr(>F)
GNP 1 323110 323110 318.7243 < 2.2e-16 ***
C.DIVD 1 35447 35447 34.9658 2.312e-06 ***
C.PROF 1 3 3 0.0028 0.9582
Residuals 28 28385 1014
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova() adds terms to the ANOVA table sequentially, so the order of the
terms is important. I don't know how the SS's are calculated in SPSS.

Actually, I think that both Reef Fish and I are right, but finding out
why we get different results is educational in itself, so I'm not going
to give it away. It's obvious once you spot it.

Bob

Jerry Dallal

unread,
Jun 27, 2005, 5:09:15 AM6/27/05
to

While I'm no economist, I have to ask whether 1960 is a typ0.

--Jerry

Russell...@wdn.com

unread,
Jun 27, 2005, 9:19:26 AM6/27/05
to

I'm no economist either (and would be highly insulted to be
called one ;-) ), but I think the GNP figure should be 25849,
not 15849.

Cheers,
Russell

Reef Fish

unread,
Jun 27, 2005, 10:06:51 AM6/27/05
to

LESSON #1. ALWAYS examine the data for gross (and not so gross)
anomaly.

Jerry and Russell are making a good start toward their Fish
University "A".

Anon Bob O'Hara is maintaining the solid "F" he had earned
in this ng.


The 15849 is of course an obvious typo, not by ME (it took me
about 10 minutes to type the data, about 30 minutes to write
a multiple regression program in SPEAKEASY because I have NO
access to any statistical package; and at least an hour to
find and correct the half a dozen or so typos of MINE <G> by
checking against the results I had done 30 years ago). The
typos were in the 1975 SPSS Manual!

The TYPO was what contributed to all THREE variables being
statistically significant in the SPSS Manual -- without it ...
that's the next chapter/Lesson. :-)

In this case, the typo was so obvious that Jerry and Russell
spotted it without doing any analysis. But the way many gross
outliers and anomalies are seen are during various graphical
display or analysis of the data.


The early spotting of the gross typo took us to first base --
there were TWO typos in the 1975 SPSS Manual. For subsequent
anslysis, change the 1960 and 1961 values of GNP as follows:

15849 ----> 25849 (good guess by Russell)
25615 ----> 26515 (common transposition error, "56" for "65")

The corrected values were used by me in my Data Analysis
Lecture Notes since 1975. The corrections were taken from
the 1972 SPSS Manual.

NOW we have the corrected data set we can start the model building
exercise.

-- Bob.

Jerry Dallal

unread,
Jun 27, 2005, 10:20:55 AM6/27/05
to
Reef Fish wrote:

> In this case, the typo was so obvious that Jerry and Russell
> spotted it without doing any analysis. But the way many gross
> outliers and anomalies are seen are during various graphical
> display or analysis of the data.

Actually, I spotted it by graphing the data, which I consider analysis.
You can rouse any of my students from a sound sleep, shine a bright
light in his/her eyes, and shout at the top of your lungs, "What's the
first thing you do with a set of data?" Without missing a beat, s/he'll
mumble, "Display it..." and go back to sleep.

--Jerry

Jerry Dallal

unread,
Jun 27, 2005, 10:55:23 AM6/27/05
to
Reef Fish wrote:

> NOW we have the corrected data set we can start the model building
> exercise.

Bob, would you double check CDIVD for 1950? It seems out of whack with
the GNP. The GNP looks smooth over time. The CDIVD spikes in 1950.

Thanks!

Reef Fish

unread,
Jun 27, 2005, 11:01:48 AM6/27/05
to

Good for you, Jerry!

That's the Corollary to Lesson 1. :-)

You must not get much sleep, because I didn't post the data until 1:57
am!
That was why I didn't think you had time to have even looked at any
display of the data, at 5 am.

Here's another SPSS anecdote. When I was at the GSB U. Chicago, in
1970, in a basement office across the hall from a Ph.D. grad student
in Behavioral Sciences, completing his thesis, Bill Ouchi.

http://www.williamouchi.com/
http://www.absoluteastronomy.com/encyclopedia/W/Wi/William_Ouchi.htm

Bill was struggling for over 2 MONTHS over some SPSS results he
couldn't understand, and came to me for help. The FIRST thing I
asked him was to show me his DATA -- NOT any of his SPSS regression
results.

It turned out that 2 MONTHS of his life went down the drain because
he missed punched (or miss-specified the input format) of some
numbers in the programs he ran. :-)

Bill is smart enough that he soon had chaired Professorships, first
at Stanford, and later at UCLA. If anyone ever run across Bill,
I am sure he'll remember what happened, and don't mind me telling
this anecdote about his SPSS runs. :-)

-- Bob.

Reef Fish

unread,
Jun 27, 2005, 11:18:02 AM6/27/05
to

I could well be, but unimportant in our model building EXERCISE.

The data I gave is without typo (of mine) from the 1975 SPSS Manual.
I did not look at the C.DIVD data nearly as carefully because that
variable turned out to be "not needed" (I am giving part of the
solution/future-Lessons away now <G> by that remark).

I haven't looked at any SPSS Manual (or had any copy) since about the
mid 1980s. But if someone can check it and correct, it would be fine
with me.

It will not alter any of my discussions to come, on the problem based
on the data ACTUALLY given, or some further minor corrections to it.

-- Bob.

Russell...@wdn.com

unread,
Jun 27, 2005, 11:19:17 AM6/27/05
to
Yes, but that could be real. I'm no more CEO than economist, but
there are reasons beyond pure macro economics for dividends
to vary. What I want to know is why the heading says the Investors
Index is defined as 100 in 1940, but in fact it is 1949 when it
equals 100.

Cheers,
Russell
--
All too often the study of data requires care.

Jerry Dallal

unread,
Jun 27, 2005, 11:50:05 AM6/27/05
to
Russell...@wdn.com wrote:
> Yes, but that could be real. I'm no more CEO than economist, but
> there are reasons beyond pure macro economics for dividends
> to vary. What I want to know is why the heading says the Investors
> Index is defined as 100 in 1940, but in fact it is 1949 when it
> equals 100.
>
> Cheers,
> Russell


Yes, but plot CDIVD, GNP, and YEAR against each other and you'll see why
I asked. I can't recall anything that would make '50 special. It
wasn't a presidential election year. The Korean War started in June,
but it lasted 3 years. It might be tied to something to do with it
being 5 years after the end of WWII, but that's stretching. When in
doubt, I always ask.

Unfortunately, not yet having retired, I've got a couple of projects to
work on today. Whatever else these data are, they are short time
series. I'm not sure how Bob is proposing we deal with them. Are we to
"pretend" they are 32 iidrvs as though it were a typical multiple
regression problem? The pre-1950 data make me especially anxious about
that approach. Take a look at what's going on in the scatterplots pre-1950.
http://www.tufts.edu/~gdallal/invdex.jpg
http://www.tufts.edu/~gdallal/3d.jpg

I'll try to get back to this tonight, although I suspect others will
pick the data apart long before then.

Anon.

unread,
Jun 27, 2005, 11:55:15 AM6/27/05
to
Jerry Dallal wrote:
> Reef Fish wrote:
>
>> In this case, the typo was so obvious that Jerry and Russell
>> spotted it without doing any analysis. But the way many gross
>> outliers and anomalies are seen are during various graphical
>> display or analysis of the data.
>
>
> Actually, I spotted it by graphing the data, which I consider analysis.

Me too. That's also the technique that lead me to seeing the typo, and
what the basic problems is with the data. I deciding that it's not
worth progressing further without seeking expert advice from an
economist: I don't even know how the response variable is arrived at (Oh
and I assume that there is a typo in the explanantion as well).

Bob

Reef Fish

unread,
Jun 27, 2005, 12:12:25 PM6/27/05
to

Russell...@wdn.com wrote:
> Yes, but that could be real. I'm no more CEO than economist, but
> there are reasons beyond pure macro economics for dividends
> to vary. What I want to know is why the heading says the Investors
> Index is defined as 100 in 1940, but in fact it is 1949 when it
> equals 100.

You'll do well as a "copy editor" for journals and books. They are
the ones who pick out MY typos, grammatic errors, and other faux
pas on the English language. :)

*I* claim that TYPO error! :-)

SPSS Manuals did give the correct definition, and it was in my notes,
that 1949 = 100. The key "9" was too close to "0" for my fat finger.

In any event, that typo is inconsequential treating this data as a
mere EXCERCISE with given data. I discount all ECONOMICS and
CORPORATE substance in the data because there are many valid concerns
that cannot be be usefully discussed on those subjects relative to the
data SPSS used for Multiple Regression on the data.


Given the above, are you ready to do some STATISTICAL analysis and
data-fitting/model-building?

-- Bob.

Jerry Dallal

unread,
Jun 27, 2005, 12:19:21 PM6/27/05
to
Here are a couple of loess smoothers. One for all of the data, another
for pre-1950 to provide additional detail.
http://www.tufts.edu/~gdallal/loess_all.jpg
http://www.tufts.edu/~gdallal/loess_pre1950.jpg

Reef Fish

unread,
Jun 27, 2005, 12:36:20 PM6/27/05
to

Anon Bob O'Hara. Your grade of "F" was earned in Fish University,
spotting the error notwithstanding, precise because of your paragraph
above AS WELL AS showing the regression results from "R" even after
you've discovered what was an OBVIOUS typo/blunder in the data.

If you had to do any TENTATIVE analysis, such as (removing the one
row with the obvious data blunder), it would have been appropriate
-- all you had to do was ASK, as Jerry and Russell did, instead
of plunging right into the GARBAGE pool.

I'll temporarily withdraw your "F", and would welcome your further
analysis based on the CORRECTED data.


That's actually my greatest criticism of BATCH statistical packages
such as SPSS and SAS which may spew out 5 to 10 pages of output of
one PROC after another, all of which may have been invalidated by
the result of the FIRST graphical display!

That's the advantage of an "interactive" software which does one
small task at a time, to accomplish what George Box (JASA article
on Science and Statistics) and my Statistical Encyclopedia
article on "Innteractive Data Analysis" talk about, in terms of
the ITERATIVE process of "model building".

As soon as SOMETHING is detected to require a change of course
or further examination, NO FURTHER RESULT should be printed that
would prove to be inappropriate given the finding during the
iterative process.


-- Bob.

Reef Fish

unread,
Jun 27, 2005, 12:48:29 PM6/27/05
to

Jerry Dallal wrote:
> Russell...@wdn.com wrote:
> > Yes, but that could be real. I'm no more CEO than economist, but
> > there are reasons beyond pure macro economics for dividends
> > to vary. What I want to know is why the heading says the Investors
> > Index is defined as 100 in 1940, but in fact it is 1949 when it
> > equals 100.
> >
> > Cheers,
> > Russell
>
>
> Yes, but plot CDIVD, GNP, and YEAR against each other and you'll see why
> I asked. I can't recall anything that would make '50 special. It
> wasn't a presidential election year. The Korean War started in June,
> but it lasted 3 years. It might be tied to something to do with it
> being 5 years after the end of WWII, but that's stretching. When in
> doubt, I always ask.

So far, so good.


>
> Unfortunately, not yet having retired, I've got a couple of projects to
> work on today.

You ASSUMED incorrectly that someone who has "retired" from academia
(out of DISGUST of those who sold their souls to the Devil) do not
have (more important than yours) projects to work on today. ;-)


> Whatever else these data are, they are short time
> series. I'm not sure how Bob is proposing we deal with them. Are we to
> "pretend" they are 32 iidrvs as though it were a typical multiple
> regression problem?

Yes, for the EXERCISE, as I had indicated in my reply to Russel,
on the same post you are following-up on, because it was done in
SPSS as a Multiple Regression example.


> The pre-1950 data make me especially anxious about
> that approach. Take a look at what's going on in the scatterplots pre-1950.
> http://www.tufts.edu/~gdallal/invdex.jpg
> http://www.tufts.edu/~gdallal/3d.jpg

It's all deja vu. :-) But you are GIVEN a dataset to FIT a
regression model, to predict INVDEX. The question is "what CAN
you do" and not "how much can I complain" about the given DATA,
which are (reasonably) assumed to be CORRECT.

>
> I'll try to get back to this tonight, although I suspect others will
> pick the data apart long before then.

Don't bother to pick the DATA apart. Do some DATA fitting, using
MODEL BUILDING methods, with the data as given.

-- Bob.

Russell...@wdn.com

unread,
Jun 27, 2005, 12:55:51 PM6/27/05
to
If you're grading me I'd actually want to think about what I'm
doing :-) and while it may look like I have nothing else to do,
I'm trying to stomp out bugs in a computer program by day
and renovate my house by night, so I'll just have to kibitz
about typos.

Cheers,
Russell

Jerry Dallal

unread,
Jun 27, 2005, 12:58:36 PM6/27/05
to
Reef Fish wrote:

>
> Jerry Dallal wrote:
>
>>Whatever else these data are, they are short time
>>series. I'm not sure how Bob is proposing we deal with them. Are we to
>>"pretend" they are 32 iidrvs as though it were a typical multiple
>>regression problem?
>
>
> Yes, for the EXERCISE, as I had indicated in my reply to Russel,
> on the same post you are following-up on, because it was done in
> SPSS as a Multiple Regression example.
>

In that case, better to label the variables A,B,C,D,... especially when
the requested analysis might not be appropriate given the labels. Just
because SPSS analyzed the data by using multiple regression doesn't mean
that such an analysis is the right way to go. Is YEAR one of the
predictors? Then, I'll be ready to do some model building.

Reef Fish

unread,
Jun 27, 2005, 1:18:36 PM6/27/05
to

Jerry Dallal wrote:
> Reef Fish wrote:
> >
> > Jerry Dallal wrote:
> >
> >>Whatever else these data are, they are short time
> >>series. I'm not sure how Bob is proposing we deal with them. Are we to
> >>"pretend" they are 32 iidrvs as though it were a typical multiple
> >>regression problem?
> >
> >
> > Yes, for the EXERCISE, as I had indicated in my reply to Russel,
> > on the same post you are following-up on, because it was done in
> > SPSS as a Multiple Regression example.
> >
>
> In that case, better to label the variables A,B,C,D,... especially when
> the requested analysis might not be appropriate given the labels.

You may, if you wish. But the labels were in SPSS's example and the
data do come from those labels.


> Just
> because SPSS analyzed the data by using multiple regression doesn't mean
> that such an analysis is the right way to go.

That's certainly correct. After the model-fitting EXERCISE, I'll
be more than happy to discuss WHY the SPSS type of regression
analysis is NOT the right way to go -- no ifs or buts about it.

But it certainly furnishes a nice example of DATA to a regression
EXERCISE, to contrast what SPSS didn't do right! We have already
found ONE -- that whoever did the example for the SPSS Manual,
certainly did not EXAMINE the data.

There are many more generic LESSONS to come, in model-building
relative to regression analyses.


> Is YEAR one of the
> predictors? Then, I'll be ready to do some model building.

That's part of the INFO (auxiliary) given in SPSS. I am giving
everyone the FULL DISCLOSURE. You can use it (or not use it) in
WHATEVER way you deem appropriate. :-) It was NOT used in
the SPSS example either as a predictor or auxiliary information.

That's all PART of the ocnsideration in ANY Data Analysis project!

-- Bob.

Jerry Dallal

unread,
Jun 27, 2005, 1:15:22 PM6/27/05
to
FWIW, blindly throwing everything, including year, into a multiple
linear regression equation (no interactions, no transformations) gives
an RMS of 622. Using GNP alone, a linear-linear spline with a knot at
13770 gives an RMS of 335.

Reef Fish

unread,
Jun 27, 2005, 1:32:16 PM6/27/05
to

Russell...@wdn.com wrote:
> If you're grading me I'd actually want to think about what I'm
> doing :-)

That's tacitly assumed to be true of ANY data analyst worth
even a little bit of salt, whether he is graded by the Fish
University or not. :-)

> and while it may look like I have nothing else to do,
> I'm trying to stomp out bugs in a computer program by day
> and renovate my house by night, so I'll just have to kibitz
> about typos.

Forget about the typos. See my further remarks (latest, right
before this post) to Jerry Dallal about how to view the data
as GIVEN, and use any or all INFO given in the SPSS Manual in
the model-building EXERCISE, just as a discussion of what
SPSS (or anyone looking at that data from SPSS in a multiple
regression) SHOULD have done.

BTW, except for the "F"s, I wouldn't be so insensitive or
presumptious to give letter grades to others.

So, nearly everyone is SAFE in that regard. I hope EVERYONE
who seriously tried to do some model building of this data
as an EXERCISE will learn some valuable LESSONS that may not
have occured to them before.

Some of what *I* have to say can only be found in *MY* Data
Analysis Lecture Notes -- that anyone will be welcome to
criticize or challenge. So, I am EXPOSING myself to GRADES
or attacks given by all the Quacks and Malpractioners (they
wouldn't know enough to contribute anything), AS WELL AS
(and more likely) challenges by those who are competent
on the subject of model-fitting in regression analysis, with
real DATA (what SOME in this ng may have never seen. LOL)

-- Bob.

bdmccu...@drexel.edu

unread,
Jun 27, 2005, 2:10:54 PM6/27/05
to
Without even graphing the data, I can see that they are time
trending, and probably have unit roots. That suggests
"spurious regression". To analyze these data, therefore,
we'll have to see whether any of the variables are cointegrated
(what Granger won his Nobel for in 2003). If so, they
will have to be analyzed via cointegration or error-correction
methods and, if not, will have to be differenced to avoid
spurious regression.

Art Kendall

unread,
Jun 27, 2005, 3:11:13 PM6/27/05
to
This is also an example why it is critical to either (double enter and
verify) or (enter and proofread) data whenever it is possible.

Art
A...@DrKendall.org
Social Research Consultants
University Park, MD USA
(301) 864-5570

Reef Fish

unread,
Jun 27, 2005, 3:57:44 PM6/27/05
to

bdmccu...@drexel.edu wrote:
> Without even graphing the data, I can see that they are time
> trending, and probably have unit roots. That suggests
> "spurious regression". To analyze these data, therefore,
> we'll have to see whether any of the variables are cointegrated
> (what Granger won his Nobel for in 2003).

Granger won his Nobel for 2003 on THAT?

I know the Economics Nobel has really been scratching the bottom
of the barrel for winners, but if that's what Granger won his
Nobel price, they could have given me the Nobel prize 30 years
ago. :-)

I take it back!! They (the Nobel Committee) COULDN'T because
Nobel (for his wife's "affair" with a mathematician) had made
sure than no mathematician or statistician could win a Nobel
Prize (for the lack of such a category).


> If so, they
> will have to be analyzed via cointegration or error-correction
> methods and, if not, will have to be differenced to avoid
> spurious regression.

Your comment about spurious correlation on time series is a
WELL KNOWN fact, but a good one to point out. It'll come into
play in subsequent LESSONS that go with this EXERCISE.

But a Nobel prize? You jest! That's KID STUFF, for ANY
applied statistician who knows anything about data anslysis.

Thanks for your commment all the same. The Nobel mention was
definitely grossly exaggerated. :-)

-- Bob.

Russell...@wdn.com

unread,
Jun 27, 2005, 4:16:25 PM6/27/05
to
http://nobelprize.org/economics/laureates/2003/index.html
"for methods of analyzing economic time series with common
trends (cointegration)"

Don't get me started on the Nobel Prize for Economics...

Cheers,
Russell

Torkel Franzen

unread,
Jun 27, 2005, 4:27:31 PM6/27/05
to
"Reef Fish" <Large_Nass...@Yahoo.com> writes:

> I take it back!! They (the Nobel Committee) COULDN'T because

There is no "Nobel Committee"

> Nobel (for his wife's "affair" with a mathematician) had made

Nobel never married.

Russell...@wdn.com

unread,
Jun 27, 2005, 4:31:34 PM6/27/05
to
This is what I get for even looking that up. I just saw that
the website lists the prize as "The Bank of Sweden Prize
in Economic Sciences in Memory of Alfred Nobel".
Economic *Sciences*! That's an oxymoron given the
way economics is presently practiced in far too many
cases. OTOH the Voodoo Economics of the Reagan
administration was tautological. Usually I have 12 months
to get over my aggravation over the previous prize, but
now I have this intervening aggravation. The Bank of
Sweden couldn't think of anything better to do with its
money?!

Cheers,
Russell

Reef Fish

unread,
Jun 27, 2005, 4:58:22 PM6/27/05
to

Torkel Franzen wrote:
> "Reef Fish" <Large_Nass...@Yahoo.com> writes:
>
> > I take it back!! They (the Nobel Committee) COULDN'T because
>
> There is no "Nobel Committee"

There are certainly Nobel award committees. You think they came
from random drawing as in the Reader's Digest sweepstakes? :-)

>
> > Nobel (for his wife's "affair" with a mathematician) had made
>
> Nobel never married.

That may explain it. It must've been Nobel's "wife to be" who
ran away with a mathematician instead of marrying him.

-- Bob.

Reef Fish

unread,
Jun 27, 2005, 5:05:47 PM6/27/05
to

Come on, Russell, start on it. :-)

Always interested in hearing others' versions about it. Nothing
personally against any of the Nobel Prize winners in Economics.
Several of them are even people I know personally. These
include about half a dozen of my former colleagues at the
University of Chicago.

But the Prize itself and its recent selections were a joke!

-- Bob.

Torkel Franzen

unread,
Jun 27, 2005, 5:12:54 PM6/27/05
to
"Reef Fish" <Large_Nass...@Yahoo.com> writes:

> There are certainly Nobel award committees. You think they came
> from random drawing as in the Reader's Digest sweepstakes? :-)

The prizes are awarded by different bodies.

> That may explain it. It must've been Nobel's "wife to be" who
> ran away with a mathematician instead of marrying him.

The whole story is pure invention.

Reef Fish

unread,
Jun 27, 2005, 7:49:27 PM6/27/05
to

If I haven't overlooked any posts of substance in this thread, over
6 hours past since your post and no one had offered any tentative
or definitive fitting model.


Let's just say you're on the GENERAL right track.


You can't eat RMS and you can't drink it.

Here's a suggestion for you (or anyone else) to try -- to get a
taste of PRACTICAL significance vs STATISTICAL significance, and
get a Reality Check of how your or any other fitted model fare,
in prediction.

HOLD OUT the data for 1966.


Use the data prior to 1966 to build the fitting/predicting model.

Now get a Prediction Interval for the INVDEX for 1966 with the
value(s) of the predictor variables for that same year to
assess how well (or how badly) it did.

You may also devise some kind of non-standard and non-textbook
measure for predictive performance, such as the average width
of prediction intervals for several rows of data close to the
row to be predicted.


I am looking forward to SOME actual models, their statistical
results and significance measures, as well as some comments
and discussion about how the model was arrived at, and how
they perform.

-- Bob.

Anon.

unread,
Jun 28, 2005, 1:09:07 AM6/28/05
to
Reef Fish wrote:
>
> bdmccu...@drexel.edu wrote:
>
>>Without even graphing the data, I can see that they are time
>>trending, and probably have unit roots. That suggests
>>"spurious regression". To analyze these data, therefore,
>>we'll have to see whether any of the variables are cointegrated
>>(what Granger won his Nobel for in 2003).
>
>
> Granger won his Nobel for 2003 on THAT?
>
> I know the Economics Nobel has really been scratching the bottom
> of the barrel for winners, but if that's what Granger won his
> Nobel price, they could have given me the Nobel prize 30 years
> ago. :-)
>
> I take it back!! They (the Nobel Committee) COULDN'T because
> Nobel (for his wife's "affair" with a mathematician) had made
> sure than no mathematician or statistician could win a Nobel
> Prize (for the lack of such a category).
>
The "Nobel" prize for economics isn't an actual Nobel prize: it was
first awarded in 1969. It's actually called "The Bank of Sweden Prize
in Economic Sciences in Memory of Alfred Nobel".

Source: http://nobelprize.org/economics/

Bob

Russell...@wdn.com

unread,
Jun 28, 2005, 9:23:20 AM6/28/05
to

True, as I pointed that out earlier, but it is popularly known
as "the Nobel Prize in Economics" (at least in the U.S.), as
when the news anchor on TV says, "The Nobel Prize in Economics
was awarded today to an economist for a piece of work that
bears no relation to reality." (OK, they don't really say the
last part, but IMO in most cases they should. True, all models
are wrong, some are useful. But in economics more are wrong in
more ways and less useful than in just about any "science" with
which I am familiar.) Another example of the bias of the
liberal media distorting the truth, I guess. ;-)

Cheers,
Russell

Torkel Franzen

unread,
Jun 28, 2005, 9:29:07 AM6/28/05
to
Russell...@wdn.com writes:

> (OK, they don't really say the
> last part, but IMO in most cases they should. True, all models
> are wrong, some are useful. But in economics more are wrong in
> more ways and less useful than in just about any "science" with
> which I am familiar.)

In Sweden, it has been argued that the association of the economics
prize with the name of Nobel is unfortunate, and can be expected to
devalue the proper Nobel prizes.

Russell...@wdn.com

unread,
Jun 28, 2005, 9:35:48 AM6/28/05
to
Swedes are an intelligent, thoughtful people.

Cheers,
Russell

G Robin Edwards

unread,
Jun 28, 2005, 3:44:48 PM6/28/05
to
In article <1119851864.7...@g14g2000cwa.googlegroups.com>,
Reef Fish <Large_Nass...@Yahoo.com> wrote:
> On June 19, Robin Edwards wrote, regarding a data set in SPSS
> I discussed, involving "model building",

Many thanks for providing this data set, Bob. Seems like I started
quite a hare with my simple request!

I have snipped all the comments, the data and everything. It has been
much repeated in the other postings.

As I wrote a few days ago I have had a first look at the data, and I
kept a log of my operations. I should point out that I deliberately
avoided reading any replies to Bob's post before doing anything with
the data. My journal, below, thus knows nothing of all the words
posted after RF's data arrived. I've now read the ones I downloaded
yesterday evening (27 June). As I write it is Tuesday evening, 28
June, and I have not downloaded anything today.

Here's my journal:-

*********************************

Data provided by Bob on 27 June 05

I shall look at this before reading others' posts.

1. Scan (eyeball) the data. No missing values. Good! Clearly time
series, so could mean trouble.

2 Notice that it is reminiscent of the famous Longley data set.

3 Import into 1st (my stats software).

4 Run naive multiple regression. Note that the software produces a
warning message. "Very possibly correlated independent variables.
Check regression diagnostics." Thus viewed the inital run as of
doubtful value.

5 So, computed regression diagnostics. Warnings about very high
multiple correlation coeffs and the equivalent variance inflation
factors. GNP and C.PROF have VIFs of 14.3 and 12.2, so one of them is
effectively redundant as a potential predictor. C.DIVD has VIF of 3.64
in this company. Note that Row 26 is a highly influential point ("HAT"
value 2.14, with next highest 1.199). Looks like an "outlier".

6 Look at Row 26. Ha ha! There it is. 15849. Clearly a typo.
Should be 25849.

7 Repair data set.

8 Repeat operations with new data. Similar (but of course different)
diagnostics on VIFs. Row 26 is no longer influential. The most
influential points are 31 and 32.

9 Look at INVDEX as a time series by Cusum plot. Noted possible steps
at 1950, 1954, 1960 and possibly 1963.

10 Generate multi-plot of all four variables plus Year (10 plots on
one diagram). This shows clearly that the predictor most likely to be
useful as a model for INVDEX is C.DIVD. The other two (closely similar
as noted in
the regression diagnostics) have a "hook", or, dare I say it, a hockey
stick, shape when INVDEX is plotted against them. Year gives a similar
but less angular plot.

11 Try a regression of INVDEX on C.DIVD. Produces adj R-Sq 0.8736, t
value for C.DIV 14.65
Forecast for 337.75 (Mean of C.DIV) of 176.9, L and U 95% interval for
a further single point is
94.46 to 259.4. For C.DIV = 700 (a reasonable extrapolation) values
are 400.8, 494.3 and 587.9. The regression plot looks fine.

12 Try a regression with C.DIV and GNP. Adj R-Sq 0.93585 - looks
good! But is it? Forecast value for mean of GNP and C.DIVD gives
118.2, 176.9 and 235.7, noticeably better than the simple regrn.
Now try C.DIVD 700 with GNP at its mean of 19314. Values are 262.86
(lower 95%), forecast 348.7 and upper 95% 434.6. These are nonsense!
No doubt the reason is the very high multiple correlations,
of which I've had warnings. Haven't tried C.Prof, but the result will
be almost exactly like the model with GNP in it. Good results very
close to the mean values and meaningless forecasts elsewhere.

13 My current choice for the best model is just

INVDEX = -119 + 0.87621*C.DIVD.

I'll do a bit more thinking about this, but can't hold out much hope of
an improvement. Maybe inspiration or advice will come from someone.

I'll post this and then have a look at all the other contributions, to
see where I've gone wrong.

**************************

That's what I wrote as I went along with the analyses earlier this
evening.

So, you can start shooting.

I should point out that I'm not a statistician - a mere long retired
industrial chemist, who came across stats via the experiment design
route, in 1956, from a book by Brownlee "Industrial Experimentation"
which was written to help industry during WW2. Looked dry as dust -
especially to someone who is no natural mathematician, but I liked the
notions of ANOVA and fractional factorials. Thought they might save me
some work!

I've looked at the original postings - very interesting!

Now to send this and download all the newer postings.

Cheers, Robin


Reef Fish

unread,
Jun 28, 2005, 4:29:40 PM6/28/05
to

G Robin Edwards wrote:
> In article <1119851864.7...@g14g2000cwa.googlegroups.com>,
> Reef Fish <Large_Nass...@Yahoo.com> wrote:
> > On June 19, Robin Edwards wrote, regarding a data set in SPSS
> > I discussed, involving "model building",
>
> Many thanks for providing this data set, Bob. Seems like I started
> quite a hare with my simple request!
>
> I have snipped all the comments, the data and everything. It has been
> much repeated in the other postings.
>
> As I wrote a few days ago I have had a first look at the data, and I
> kept a log of my operations. I should point out that I deliberately
> avoided reading any replies to Bob's post before doing anything with
> the data.

Excellent!!

i was somewhat worried about the unfair influence (good or bad) by
others. I tried to refrain from saying much, but what others had
about what they found is hard to ignore unless you don't read them.


> My journal, below, thus knows nothing of all the words
> posted after RF's data arrived. I've now read the ones I downloaded
> yesterday evening (27 June). As I write it is Tuesday evening, 28
> June, and I have not downloaded anything today.

Nothing happened to day (in non time-series) except your post now.
I'll withold comments until I get something more from Jerry Dallal
(I hope he'll find time to add more to what he had already done)
and anyone else.

>
> Here's my journal:-
>
> *********************************
>
> Data provided by Bob on 27 June 05
>
> I shall look at this before reading others' posts.
>
> 1. Scan (eyeball) the data. No missing values. Good! Clearly time
> series, so could mean trouble.

Both good observations.


>
> 2 Notice that it is reminiscent of the famous Longley data set.

Not in the multicollinearity sense. BTW, here's a "side lesson":

Highly correlated independent variables (such as r > .9) does
not NECESSARILY imply collinearity problems. On the other
hand you MAY have a singular correlation matrix even if ALL
of the pairwise correlations are < .2 say.

>
> 3 Import into 1st (my stats software).
>
> 4 Run naive multiple regression. Note that the software produces a
> warning message. "Very possibly correlated independent variables.
> Check regression diagnostics." Thus viewed the inital run as of
> doubtful value.

See my "side lesson" above. You PROGRAM may be issuing FALSE to
MISLEADING warnings. ANOTHER abuse of the "correlation coeff". :-)

The software must examine the EIGENVALUES of X'X to correctly
detect multicollinearity conditions/problems!!

I'll stop my comments here. Will resume with the rest of your
analysis/results at the conclusion of the Million $ Challenge. :-)

Thanks for your effort and interest. I think almost everyone
will learn SOMETHING from it. The misunderstanding of the
detection and effects of multicollinearity ranks among the
highest "regression abuses" I know (next to the "expected
sign" fallacy).

Stay tuned under the LESSON 2 thread.

-- Bob.

Jerry Dallal

unread,
Jun 28, 2005, 10:10:07 PM6/28/05
to
Given that this is a time series, it's hard to get worked up about an
inappropriate analysis, so I'll stop with

invdex = 181.9 - 0.016 gnp +
+ 0.028 I(gnp>=13770)*(gnp-13770)
+ 0.114 cprof + 56.7 I(year=1961)
where I(x)=1 if x is true and 0 otherwise

ResMS = 203.5

Unless I've made a typo transcribing. The ResMS is correct, though.

The residual plot gives the impression that the variability is larger
for larger predicted values.

Reef Fish

unread,
Jun 28, 2005, 11:39:41 PM6/28/05
to

It doesn't appear there'll be any more entries. I suspect Jerry
is either too busy or his knot and spline software has a feature
for him to get prediction intervals or PRACTICAL significance
assessments.

So, I'll going ahead and finish commenting on your analysis here,
then continue my "lessons" without Jerry. He can always do more
after I show what I did (30 year ago) which was a better fit than
his.


>
> > 5 So, computed regression diagnostics. Warnings about very high
> > multiple correlation coeffs and the equivalent variance inflation
> > factors. GNP and C.PROF have VIFs of 14.3 and 12.2, so one of them is
> > effectively redundant as a potential predictor. C.DIVD has VIF of 3.64
> > in this company. Note that Row 26 is a highly influential point ("HAT"
> > value 2.14, with next highest 1.199). Looks like an "outlier".
> >
> > 6 Look at Row 26. Ha ha! There it is. 15849. Clearly a typo.
> > Should be 25849.
> >
> > 7 Repair data set.

You were a bit late here. I had corrected that value at about 10 am
the same morning I posted the original data at 1:59 am. So, you must
have stopped looking as soon as you saw my original data.

In any event data examination and graphical displays should have
been done first, before doing any computation such as correlations.

> >
> > 8 Repeat operations with new data. Similar (but of course different)
> > diagnostics on VIFs. Row 26 is no longer influential. The most
> > influential points are 31 and 32.

If you had done some scatter plots of INVDEX vs the other variables,
you would have notice the obvious "elbow" Jerry found, and the same
elbow AUTOBOX found using time-series methods. See LESSON 2 for
details.

> >
> > 9 Look at INVDEX as a time series by Cusum plot. Noted possible steps
> > at 1950, 1954, 1960 and possibly 1963.
> >
> > 10 Generate multi-plot of all four variables plus Year (10 plots on
> > one diagram). This shows clearly that the predictor most likely to be
> > useful as a model for INVDEX is C.DIVD. The other two (closely similar
> > as noted in
> > the regression diagnostics) have a "hook", or, dare I say it, a hockey
> > stick, shape when INVDEX is plotted against them. Year gives a similar
> > but less angular plot.

Ah, you DID notice the "hockey stick" which I called the "elbow", but
you let the golden goose walk by and grabbed the quacking duck
instead! :-) Again, see my continuation of LESSON 2.

> >
> > 11 Try a regression of INVDEX on C.DIVD. Produces adj R-Sq 0.8736, t
> > value for C.DIV 14.65
> > Forecast for 337.75 (Mean of C.DIV) of 176.9, L and U 95% interval for
> > a further single point is
> > 94.46 to 259.4. For C.DIV = 700 (a reasonable extrapolation) values
> > are 400.8, 494.3 and 587.9. The regression plot looks fine.
> >
> > 12 Try a regression with C.DIV and GNP. Adj R-Sq 0.93585 - looks
> > good! But is it? Forecast value for mean of GNP and C.DIVD gives
> > 118.2, 176.9 and 235.7, noticeably better than the simple regrn.
> > Now try C.DIVD 700 with GNP at its mean of 19314. Values are 262.86
> > (lower 95%), forecast 348.7 and upper 95% 434.6. These are nonsense!
> > No doubt the reason is the very high multiple correlations,
> > of which I've had warnings. Haven't tried C.Prof, but the result will
> > be almost exactly like the model with GNP in it. Good results very
> > close to the mean values and meaningless forecasts elsewhere.

These are nice exploratory steps. Unfortunately, you did not take
advantage of the "golden goose" and ended with this model:

> >
> > 13 My current choice for the best model is just
> >
> > INVDEX = -119 + 0.87621*C.DIVD.

This was based on all 32 observations, and would have yield a
MSE of 1582, which is almost 5 times the MSE (or RMS) of 335 Jerry
got with GNP alone as the predictor, which was also about HALF the
RMS of 622 of the SPSS-like multiple regression model with the
kitchen sink thrown in.

> >
> > I'll do a bit more thinking about this, but can't hold out much hope of
> > an improvement. Maybe inspiration or advice will come from someone.
> >
> > I'll post this and then have a look at all the other contributions, to
> > see where I've gone wrong.
> >
> > **************************
> >
> > That's what I wrote as I went along with the analyses earlier this
> > evening.

Very nicely document. Helped others see your thought process as well
as where and how you missed the boat, so to speak, when you read my
LESSON 2.

> >
> > So, you can start shooting.

Sorry, the golden goose already walked away. :-)

> >
> > I should point out that I'm not a statistician - a mere long retired
> > industrial chemist, who came across stats via the experiment design

You certainly showed much better insight and thoughtfulness in your
exploratory step than MOST "applied statisticians" would have done,
sort of like the SPSS Manual example -- "Garbage IN, Garbage Out".
They might even get busy discussing whether the SIGN of one of the
coefficient is right or not. :-)


> > route, in 1956, from a book by Brownlee "Industrial Experimentation"
> > which was written to help industry during WW2. Looked dry as dust -
> > especially to someone who is no natural mathematician, but I liked the
> > notions of ANOVA and fractional factorials. Thought they might save me
> > some work!
> >
> > I've looked at the original postings - very interesting!
> >
> > Now to send this and download all the newer postings.
> >
> > Cheers, Robin

Data analysis and model-building are things that always have UNIQUE
features in every data set, and only those trained to look out for
them and take advantage of any "golden goose" they see during the
iterative process can consistently do well.

Thanks to you voluntary participation (at the risk of being shot),
I believe you've contributed more than you realized, to help OTHERS
think more about what THEY might do, the next time they get hold
of ANY multiple regression data set, that life is much more
interesting and fruitful than just throwing all the variables into
a large scale model, and look only at correlations and coefficient
signs.

Now you, or any reader who is following this EXERCISE, may continue
reading my continuation of the LESSON 2 thread, on this same data set.

-- Bob.

Reef Fish

unread,
Jun 28, 2005, 11:46:44 PM6/28/05
to

Jerry Dallal wrote:
> Given that this is a time series, it's hard to get worked up about an
> inappropriate analysis, so I'll stop with
>
> invdex = 181.9 - 0.016 gnp +
> + 0.028 I(gnp>=13770)*(gnp-13770)
> + 0.114 cprof + 56.7 I(year=1961)
> where I(x)=1 if x is true and 0 otherwise
>
> ResMS = 203.5

I was about to give up on you (when I was typing my follow-up to
Robin's post in this thread).


>
> Unless I've made a typo transcribing. The ResMS is correct, though.
>
> The residual plot gives the impression that the variability is larger
> for larger predicted values.

Now I have to take a look at your model before continuing with the
LESSON 2 thread. Meanwhile, if it's not time consuming, you may
want to give a PREDICTION interval for the held-out last row of
data (1966) pretending that you're trying to predict the INDVEX
fot that year using your model and the GNP value of 35822 for that
year.

-- Bob.

Reef Fish

unread,
Jun 29, 2005, 12:12:40 AM6/29/05
to

Jerry Dallal wrote:
> Given that this is a time series, it's hard to get worked up about an
> inappropriate analysis, so I'll stop with
>
> invdex = 181.9 - 0.016 gnp +
> + 0.028 I(gnp>=13770)*(gnp-13770)
> + 0.114 cprof + 56.7 I(year=1961)
> where I(x)=1 if x is true and 0 otherwise
>
> ResMS = 203.5

It didn't take long for me to have absorbed all your wisdom above.
So, I'll interpret it for the readers, comment on it, and will be
ready to finish LESSON 2, and later lessons right after this.

I can't resist my mock horror, "But your SIGN of GNP is WRONG,
Jerry !!!!! How can you EXPLAIN to anyone that the INVDEX will
drop when the GNP value RISES?" :-)

Multiple Regression "expected sign" abusers take note!


Your model is basically taking advantage of the "hockey plug" in
GNP vs year (which you took the "elbow" to be between 1941 and
1942, estimating the "knot" point to be 13770 for GNP. You used
CPROF as the 2nd indep. variable, and then you saved a bundle of
SSE by adjusting for ONE value of the fitted function at 1961,
to drop the residual from 56.7 to 0 (I presume), thus lowering
the SSE by 56.7^2 or a rather substantial 3,214.89. :-)

That's why your RMS is so much smaller than your previous 335.
I consider that "cheating" (in a non-criminal way <G>) because if
you play that game by setting residuals to 0 at will, you can
drop the RMS even further, but the model will hardly be a good
or valid one for prediction purposes. It's more like OVER-FITTING.

Let's move over to the continuation of the LESSON 2 thread.

-- Bob.

Jerry Dallal

unread,
Jun 29, 2005, 10:20:52 AM6/29/05
to
Reef Fish wrote:
>
> Jerry Dallal wrote:
>
>>Given that this is a time series, it's hard to get worked up about an
>>inappropriate analysis, so I'll stop with
>>
>>invdex = 181.9 - 0.016 gnp +
>> + 0.028 I(gnp>=13770)*(gnp-13770)
>> + 0.114 cprof + 56.7 I(year=1961)
>> where I(x)=1 if x is true and 0 otherwise
>>
>>ResMS = 203.5
>
>
> It didn't take long for me to have absorbed all your wisdom above.
> So, I'll interpret it for the readers, comment on it, and will be
> ready to finish LESSON 2, and later lessons right after this.
>
> I can't resist my mock horror, "But your SIGN of GNP is WRONG,
> Jerry !!!!! How can you EXPLAIN to anyone that the INVDEX will
> drop when the GNP value RISES?" :-)

That's easy. It's a corollary to "Never interpret main effects in the
presence of an interaction!"

Even more shocking is the decision to constrain the first part of the
spline to be horizontal!

[copied from another post in the same thread. Quotation from Reef Fish:
"Then it occurred to me that it made sense for the relation to be
nearly horizontal during the WWII era and then both variables
were growing strong in a linear fit in the post-war years."]

*Forcing* the sign?!?! The HORROR! :-)

> Multiple Regression "expected sign" abusers take note!

Indeed!

>
> Your model is basically taking advantage of the "hockey plug" in
> GNP vs year (which you took the "elbow" to be between 1941 and
> 1942, estimating the "knot" point to be 13770 for GNP. You used
> CPROF as the 2nd indep. variable, and then you saved a bundle of
> SSE by adjusting for ONE value of the fitted function at 1961,
> to drop the residual from 56.7 to 0 (I presume), thus lowering
> the SSE by 56.7^2 or a rather substantial 3,214.89. :-)
>
> That's why your RMS is so much smaller than your previous 335.
> I consider that "cheating" (in a non-criminal way <G>) because if
> you play that game by setting residuals to 0 at will, you can
> drop the RMS even further, but the model will hardly be a good
> or valid one for prediction purposes. It's more like OVER-FITTING.

RSS, yes, RMS, not necessarily. In any analysis I do, 1961 is an
outlier. If year 'x' isn't squirrelly, the contribution of I(x) will be
negligible. Fitting I(x) is equivalent to setting an observation aside
based on its externally Studentized residual. Granted, probability
theory guarantees that there be some large externally Studentized
residual, but the one for 1961 is huge. I(1961) has a t statistic of
3.74 and an observed significance level of 0.00091, which survives even
a Bonferroni adjustment, (32*0.00091=0.02912). While this might be
overfitting for data from biological units, these are US national level
economic data. 1961 sticks out like a sore thumb.

G Robin Edwards

unread,
Jun 29, 2005, 6:43:09 PM6/29/05
to
In article <1120016381.1...@g14g2000cwa.googlegroups.com>,
Reef Fish <Large_Nass...@Yahoo.com> wrote:

Does this apply to multiple correlations? All I'm trying to do is to
make reasonably sure that the data do not suffer from the effects of
multi-colinearity. Thus a "warning" is issued if the software thinks
there's a possibility of strong multiple correlation, something that I
think might be not noticed in pairwise plots in some cases.


> >
> > The software must examine the EIGENVALUES of X'X to correctly
> > detect multicollinearity conditions/problems!!

Can you explain to me how this approach might relate (if at all) to
computing Cholesky inverse roots? I wrote this software in about 1977
to run on a Commodore PET (32K memory for programs and data) and I
can't remember the details of the technique, or even why I thought it
might be useful :-(

You must remember that I do not have broadband. I switch on my machine
once each evening, and download at about 4k characters/second. I'm on
line for perhaps 2 - 4 minutes, so it is far from real time!

> In any event data examination and graphical displays should have been
> done first, before doing any computation such as correlations.

> > >
> > > 8 Repeat operations with new data. Similar (but of course
> > > different) diagnostics on VIFs. Row 26 is no longer influential.
> > > The most influential points are 31 and 32.

> If you had done some scatter plots of INVDEX vs the other variables,
> you would have notice the obvious "elbow" Jerry found, and the same
> elbow AUTOBOX found using time-series methods. See LESSON 2 for
> details.

> > >
> > > 9 Look at INVDEX as a time series by Cusum plot. Noted possible
> > > steps at 1950, 1954, 1960 and possibly 1963.
> > >
> > > 10 Generate multi-plot of all four variables plus Year (10 plots
> > > on one diagram). This shows clearly that the predictor most
> > > likely to be useful as a model for INVDEX is C.DIVD. The other
> > > two (closely similar as noted in the regression diagnostics) have
> > > a "hook", or, dare I say it, a hockey stick, shape when INVDEX is
> > > plotted against them. Year gives a similar but less angular plot.

> Ah, you DID notice the "hockey stick" which I called the "elbow", but
> you let the golden goose walk by and grabbed the quacking duck
> instead! :-) Again, see my continuation of LESSON 2.

I always plot data in various ways :-) But running linear regressions
is such a simple step - (specifying a model takes moments and the
calculations a second or so for ordinary size data sets) that it might
as well be done. One never knows what might turn up.

> > >
> > > 11 Try a regression of INVDEX on C.DIVD. Produces adj R-Sq
> > > 0.8736, t value for C.DIV 14.65 Forecast for 337.75 (Mean of
> > > C.DIV) of 176.9, L and U 95% interval for a further single point
> > > is 94.46 to 259.4. For C.DIV = 700 (a reasonable extrapolation)
> > > values are 400.8, 494.3 and 587.9. The regression plot looks
> > > fine.
> > >
> > > 12 Try a regression with C.DIV and GNP. Adj R-Sq 0.93585 -
> > > looks good! But is it? Forecast value for mean of GNP and
> > > C.DIVD gives 118.2, 176.9 and 235.7, noticeably better than the
> > > simple regrn. Now try C.DIVD 700 with GNP at its mean of 19314.
> > > Values are 262.86 (lower 95%), forecast 348.7 and upper 95%
> > > 434.6. These are nonsense! No doubt the reason is the very high
> > > multiple correlations, of which I've had warnings. Haven't tried
> > > C.Prof, but the result will be almost exactly like the model with
> > > GNP in it. Good results very close to the mean values and
> > > meaningless forecasts elsewhere.

> These are nice exploratory steps. Unfortunately, you did not take
> advantage of the "golden goose" and ended with this model:

> > >
> > > 13 My current choice for the best model is just
> > >
> > > INVDEX = -119 + 0.87621*C.DIVD.

> This was based on all 32 observations, and would have yield a
> MSE of 1582,

Agreed


> which is almost 5 times the MSE (or RMS) of 335 Jerry
> got with GNP alone as the predictor, which was also about HALF the
> RMS of 622 of the SPSS-like multiple regression model with the
> kitchen sink thrown in.

Also agreed

I look forward to Lesson 2.

However, I worry a bit about inferential statistics when the fitted
model has been arrived at by a stepwise programme driven by the outcome
of preliminary analyses which lead to modification of the originally
proposed model. I had always thought that technical specification of
the model should precede any analytical work. Is this correct?


Reef Fish

unread,
Jun 29, 2005, 2:40:43 PM6/29/05
to

Jerry Dallal wrote:
> Reef Fish wrote:
> >
> > Jerry Dallal wrote:
> >
> >>Given that this is a time series, it's hard to get worked up about an
> >>inappropriate analysis, so I'll stop with
> >>
> >>invdex = 181.9 - 0.016 gnp +
> >> + 0.028 I(gnp>=13770)*(gnp-13770)
> >> + 0.114 cprof + 56.7 I(year=1961)
> >> where I(x)=1 if x is true and 0 otherwise
> >>
> >>ResMS = 203.5
> >
> >
> > It didn't take long for me to have absorbed all your wisdom above.
> > So, I'll interpret it for the readers, comment on it, and will be
> > ready to finish LESSON 2, and later lessons right after this.
> >
> > I can't resist my mock horror, "But your SIGN of GNP is WRONG,
> > Jerry !!!!! How can you EXPLAIN to anyone that the INVDEX will
> > drop when the GNP value RISES?" :-)
>
> That's easy. It's a corollary to "Never interpret main effects in the
> presence of an interaction!"

For the sake of not getting off the present topic, I'll leave your
statement alone, though it didn't quite technically or precisely
apply to the example in question. :)


> Even more shocking is the decision to constrain the first part of the
> spline to be horizontal!
>
> [copied from another post in the same thread. Quotation from Reef Fish:
> "Then it occurred to me that it made sense for the relation to be
> nearly horizontal during the WWII era and then both variables
> were growing strong in a linear fit in the post-war years."]
>
> *Forcing* the sign?!?! The HORROR! :-)

Actually that's not quite in MY case; perhaps more in YOURS, for
forcing a horizontal spline.

In MY case, I was merely THROWING away historical data that are
justifiable to be thrown away (BTW, deleted ANY observation, let
alone more than 1, is a serious business that MUST be justified
other than "it didn't fit") because it made economic as well as
commonsense that there are two different relations in pre-war/war
and post-war eras. And even if it DIDN'T make any strong sense,
it made sense (in such a time series) to discard data in a
DISTANT past, when the object is to predict the most recent
present.

>
> > Multiple Regression "expected sign" abusers take note!
>
> Indeed!
>
> >
> > Your model is basically taking advantage of the "hockey plug" in
> > GNP vs year (which you took the "elbow" to be between 1941 and
> > 1942, estimating the "knot" point to be 13770 for GNP. You used
> > CPROF as the 2nd indep. variable, and then you saved a bundle of
> > SSE by adjusting for ONE value of the fitted function at 1961,
> > to drop the residual from 56.7 to 0 (I presume), thus lowering
> > the SSE by 56.7^2 or a rather substantial 3,214.89. :-)
> >
> > That's why your RMS is so much smaller than your previous 335.
> > I consider that "cheating" (in a non-criminal way <G>) because if
> > you play that game by setting residuals to 0 at will, you can
> > drop the RMS even further, but the model will hardly be a good
> > or valid one for prediction purposes. It's more like OVER-FITTING.
>
> RSS, yes, RMS, not necessarily.

NOW we can get down to some concrete discussion of NUMBERS. Not
necessarily, yes. But VERY easily so.

For my argument and illustration, I'll have to do a bit of detective
work since I don't know any of your residuals except the 56.7 one.

But I can infer, from your previous result of MSE = 335 with only
the GNP, and the location of spine and knot value (on the entire 32
observations, hence 28 df?) that your SSE was 9380.

The setting of ONE residual of 56.7 to zero (and nothing else)
would have reduced the SSE to 6166 and the MSE to 228 on 27 df.

Your actual model (for which I assume 25 df, but doesn't really
matter if it's one or two more, or less) would imply SSE = 5088,
given your MSE of 203.5.

If you "fix" another residual of 30 by setting it to zero, you
would have reducedd the SSE df to 24, but reduced the MSE to 182.

"Fix" a third residual of size 30, you would have reduced the MSE
to 157! And so on.

I believe in general, the fixing of ONE unusually large residual
must be based NOT on the size of the residual, but on REASONS
you can explain on WHY it's excessively large.

There are many examples of such in the analysis of time series.
For example, the daily national total on the use of fireworks
would most likely have a peak on the 4th of July; of alcohol
consumption on special events/days of the year.

In this EXERCISE, there was nothing you attached to 1961 other
than having observed a large residual. In the grand scheme of
things, that's at least a faux pas or peccadillo. Those are
the grades James F. Kilpatrick assigned to various erroneous
usage of the English language in his nationally syndicated
columns on "crimes, misdemeanors, faux pas, and peccadillos).

I make many peccadillos everyday! :-)

> In any analysis I do, 1961 is an outlier.

But not good enough reason for "deleting" or neutralizing the
negative redisual effect for the year 1961.

> If year 'x' isn't squirrelly, the contribution of I(x) will be
> negligible. Fitting I(x) is equivalent to setting an observation aside
> based on its externally Studentized residual. Granted, probability
> theory guarantees that there be some large externally Studentized
> residual, but the one for 1961 is huge. I(1961) has a t statistic of
> 3.74 and an observed significance level of 0.00091, which survives even
> a Bonferroni adjustment, (32*0.00091=0.02912). While this might be
> overfitting for data from biological units, these are US national level
> economic data. 1961 sticks out like a sore thumb.

Now you're talking almost like a social scientist who is happen to
throw away anything that doesn't fit until what's left fit TOO well.
:-)

I prefer your previous model of having a knot and using only GNP to
result in a MSE of 335. I also prefer mine over yours for reasons
of Box's principle of "parsimony" or Occum's razor.

-- Bob.

Jerry Dallal

unread,
Jun 29, 2005, 3:23:20 PM6/29/05
to
I wrote:

> When Hinkley's tests for two-phase regression are performed, the
> observed significance levels for equality of the slopes and for the
> slope of the later part of the data being 0 are <0.0001. The osl for
> the slope of the earlier data being different from 0 is 0.1552.

strike "different from" in the last line. I was stating null
hypotheses, or trying to, anyway.

Jerry Dallal

unread,
Jun 29, 2005, 3:21:49 PM6/29/05
to
Reef Fish wrote:
>
> I prefer your previous model of having a knot and using only GNP to
> result in a MSE of 335. I also prefer mine over yours for reasons
> of Box's principle of "parsimony" or Occum's razor.
>
> -- Bob.
>

I agree, especially about the vicious cycle of setting data aside. I
was just passing the time of day. The labels matter/they don't matter.
It's a time series/treat them as independent observations. It's the
wrong analysis/let's do it anyway. Hard to get worked up over the data
in a situation like this.

When Hinkley's tests for two-phase regression are performed, the
observed significance levels for equality of the slopes and for the
slope of the later part of the data being 0 are <0.0001. The osl for
the slope of the earlier data being different from 0 is 0.1552.

1961 was odd enough that I'd go back and check it. It's INVDEX that
appears to be odd, even when plotted against year. It's the
scatterplots more than any statistical quantity that led me to question
it. If the number is what was reported, then that's what it is.

Reef Fish

unread,
Jun 29, 2005, 9:59:01 PM6/29/05
to

G Robin Edwards wrote:
> In article <1120016381.1...@g14g2000cwa.googlegroups.com>,
> Reef Fish <Large_Nass...@Yahoo.com> wrote:

< GIGANTIC snip to get to the technical points in question >
>
Robin>> 4 Run naive multiple regression. Note that the software


> > > > produces a warning message. "Very possibly correlated
> > > > independent variables. Check regression diagnostics." Thus
> > > > viewed the inital run as of doubtful value.
> > >
> > > See my "side lesson" above. You PROGRAM may be issuing FALSE to
> > > MISLEADING warnings. ANOTHER abuse of the "correlation coeff". :-)
>
> Does this apply to multiple correlations? All I'm trying to do is to
> make reasonably sure that the data do not suffer from the effects of
> multi-colinearity. Thus a "warning" is issued if the software thinks
> there's a possibility of strong multiple correlation, something that I
> think might be not noticed in pairwise plots in some cases.

There are two separate and unrelated points here.

1. Pairwise correlations among independent variables tells NOTHING
about multicolinearity, unless one of the them is ridiculously
high, say .99999. If ALL of the pairwise correlations are less
than .2 say, you not only can have multicollinearity problems,
but the X'X matrix MAY even be "singular" or the regression
coefficients undeterminable.

2. The MULTIPLE R (or R^2) is completely different. A high R is
a GOOD thing, because it is the simple correlation between
the observed Y nd the fitted Y! The higher the R, the better
the fit.

That's why I said your SOFTWARE may have been written by someone
who doesn't know the statistical theory and issued false and/or
erroneous warnings, as indicated by your descriptions.

>
> > > The software must examine the EIGENVALUES of X'X to correctly
> > > detect multicollinearity conditions/problems!!
>
> Can you explain to me how this approach might relate (if at all) to
> computing Cholesky inverse roots? I wrote this software in about 1977
> to run on a Commodore PET (32K memory for programs and data) and I
> can't remember the details of the technique, or even why I thought it
> might be useful :-(

The Cholesky decomposition is ONE of the MANY methods of doing an
eigen-decomposition as well as for matrix inversion.

It is tangent to the statistical interpretation of the regression
results. A numerical analysis book or a book on Statistical
Computing would likely address the question much more adequately
than I can or want to do it here.


> > Now you, or any reader who is following this EXERCISE, may continue
> > reading my continuation of the LESSON 2 thread, on this same data set.
>
> I look forward to Lesson 2.
>
> However, I worry a bit about inferential statistics when the fitted
> model has been arrived at by a stepwise programme driven by the outcome
> of preliminary analyses which lead to modification of the originally
> proposed model. I had always thought that technical specification of
> the model should precede any analytical work. Is this correct?

That is essentially correct and your worry is appropriate. ANY kind
of "automatic" selection method ignores the probability model
assumptions (or its violations) and only search for the best "fit".

Stepwise type of regressoin is a cheap way to get some PLAUSIBLE
candidate models on the basis of "fit" only. Once found, the
analyst must examine the residuals as carefully as they would do
in a "manual" iterative process.

Often, the "best" fitting models have to be abandoned in favor of
worse fitting but more appropriate models from a residual analysis
point of view.

Note in LESSON 3, I indicated before I exhibited the final model,
that I had gone through all of the normality, independence and
homoscedasticity tests on the residuals.

-- Bob.

Abe Kohen

unread,
Jun 30, 2005, 9:38:53 PM6/30/05
to
"Torkel Franzen" <tor...@sm.luth.se> wrote in message
news:vcb7jge...@beta19.sm.ltu.se...

You think Kahneman, Harsanyi, Nash, Sharpe, Markowitz, Modigliani, Friedman
and Samuelson are not Nobel quality?

IMHO, it is the Nobel Peace Prize which devalues all others.

Abe


Reef Fish

unread,
Jul 1, 2005, 3:25:12 AM7/1/05
to

Nash, if he should win anything at all, should have won it in
Mathematics. But we all know that Math and Stat were SCREWED out
of the Nobel categories. ;^)

Nash won it for being schizo AND being crazy at the right time,
when the ECON category was already running out of worthy
candidates YEARS ago, and had started the tradition of finding
excuses for giving the prize to NON-economists. Simon is
another example. There are other notable examples in recent
years.

There were MANY more qualified Mathematicians for a Nobel Prize
(the Field's medal winners, e.g., a coveted Mathematics Prize
which Nash never won), had they not be SCREWED by Nobel himself. :-)

So there!


And why didn't you mention Miller, another NON-economist, who
won it on the strength of having been Modigliani's TEACHER,
and the co-winner on the Miller-Modigliani theory. :)

Friedman? He should have won it LONG before he actually did.
His theory was directly opposed to Samuelson's, and since the
"banking committee" was obviously pro-Samuelson and anti-Friedman,
they kept him OFF the Nobel prize as long as they could, which
was already obvious to EVERYONE who knew anything about ECONOMICS
that Uncle Milty :-) should have won it, perhaps even before
Samuelson did.

Besides, Milton Friedman could have won a Nobel Prize for his
STATISTICAL work, for the years he collaborated and SUPERVISED
the likes of Fred Mosteller, Jimmie Savage, and other Nobel-
prize-deserving statisticians, in STATISTICAL anslyses, had
Statisticians not be SCREWED out of the Nobel category, because
Nobel mis-associated the field as Mathematics.

>
> IMHO, it is the Nobel Peace Prize which devalues all others.
>
> Abe

Are you throwing your hat into the ring for the Nobel Peace
Prize, (Honest) Abe? :-) If it accelerates its devaluation
rate, and you live to be a thousand-year-old man (curse to
medical science <g>), even YOU may have a SHOT at it, (Honest)
Abe.

-- Bob (NOT Hogg; NOT Anon-O'Hara) the Reef Fish.

Reef Fish

unread,
Jul 1, 2005, 3:25:05 AM7/1/05
to

Nash, if he should win anything at all, should have won it in

So there!

>


> IMHO, it is the Nobel Peace Prize which devalues all others.
>
> Abe

Are you throwing your hat into the ring for the Nobel Peace

Reef Fish

unread,
Jul 1, 2005, 3:25:16 AM7/1/05
to

Nash, if he should win anything at all, should have won it in

So there!

>


> IMHO, it is the Nobel Peace Prize which devalues all others.
>
> Abe

Are you throwing your hat into the ring for the Nobel Peace

Torkel Franzen

unread,
Jul 1, 2005, 3:35:14 AM7/1/05
to
"Abe Kohen" <ako...@xenon.stanford.edu> writes:

> You think Kahneman, Harsanyi, Nash, Sharpe, Markowitz, Modigliani, Friedman
> and Samuelson are not Nobel quality?

I know nothing about these people. The argument, as I have seen it
presented, turned not on the qualities of individual recipients, but on
the nature of the subject.

Torkel Franzen

unread,
Jul 1, 2005, 3:38:21 AM7/1/05
to
"Reef Fish" <Large_Nass...@Yahoo.com> writes:

> Statisticians not be SCREWED out of the Nobel category, because
> Nobel mis-associated the field as Mathematics.

I see that you stick with grit and determination to your tradition
of promoting your favorite fantasies!

Herman Rubin

unread,
Jul 1, 2005, 10:32:08 AM7/1/05
to
In article <1120202705....@g14g2000cwa.googlegroups.com>,
Reef Fish <Large_Nass...@Yahoo.com> wrote:


>Abe Kohen wrote:
>> "Torkel Franzen" <tor...@sm.luth.se> wrote in message
>> news:vcb7jge...@beta19.sm.ltu.se...
>> > Russell...@wdn.com writes:

>> > > (OK, they don't really say the
>> > > last part, but IMO in most cases they should. True, all models
>> > > are wrong, some are useful. But in economics more are wrong in
>> > > more ways and less useful than in just about any "science" with
>> > > which I am familiar.)

>> > In Sweden, it has been argued that the association of the economics
>> > prize with the name of Nobel is unfortunate, and can be expected to
>> > devalue the proper Nobel prizes.

>> You think Kahneman, Harsanyi, Nash, Sharpe, Markowitz, Modigliani, Friedman
>> and Samuelson are not Nobel quality?

>Nash, if he should win anything at all, should have won it in
>Mathematics. But we all know that Math and Stat were SCREWED out
>of the Nobel categories. ;^)

Stat was not even a topic at the time, and the story about
Math being screwed out has long been discredited.

[Much deleted; I only agree with part of it.]

>> IMHO, it is the Nobel Peace Prize which devalues all others.

>> Abe

>Are you throwing your hat into the ring for the Nobel Peace
>Prize, (Honest) Abe? :-) If it accelerates its devaluation
>rate, and you live to be a thousand-year-old man (curse to
>medical science <g>), even YOU may have a SHOT at it, (Honest)
>Abe.

Considering the people who have been awarded the Peace
Prize, and how little most of them did for peace, and in
fact how much most of them did to hinder the cause of
peace, I have to agree completely with Abe.

--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

Reef Fish

unread,
Jul 1, 2005, 11:01:01 AM7/1/05
to

What fantasy? About Nobel SCREWING the Mathematicians and
Statisticians?

Then how do you explain why the is NO category of Nobel Prize for
Mathematics OR Statistics? Even if you discount the recent history
of Statistics, Mathematics as a science far exceeded all the other
sciences in the history of contribution to all sciences.

Andrew Wiles and Taniyama and Shimura should have shared a Nobel
Prize, had they NOT been screwed by Nobel himself, for slaying
the Grandeset Dragon of Mathematics, Fermat's Last Theorem,
which stood unproved for 550 years until they came along!

Wiles was credited with the actual proof, but Wiles would not
have been able to prove it without the Taniyama-Shimura conjecture
which itself stood for decades unproved.


It would take some 3rd rate economist to find some trivial use
or contrived used of Fermat's Las Theorem, and THEN Wiles will be
awarded the Nobel Prize in ECONOMICS (given by the Swiss Bank <G>)
for some economic nonsense.

That's the way it'll be.

Mathematicians are SCREWED by Nobel!

But for YOUR consolation, even if Nobel gives 1000 Prizes to
mathematicians EVERY year, Torkel Franzen would not surface to
the top 100,000 within the next millenium. Trust me! :-)

-- Bob.

Russell...@wdn.com

unread,
Jul 1, 2005, 11:03:51 AM7/1/05
to

As I wrote in another post, don't get me started...

Nash, Harsanyi, Kahneman, yes, and some others (Arrow comes to
my mind), but IMO the good work for which this prize has been
given has been mostly the highly mathematical results which have
broader interest (at least to mathematicians) and applications
beyond economics. Also good work for which the prize has been
given is the work that shows that most of economics, as it is
presently practiced by the dominant school in the subject, is
built on a foundation of sand (again Arrow comes immediately
to mind, along with Kahneman). Often the closer to empirical
the work is, the worse it is, IMO, in terms of pure scientific
value and actual scien