Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Tips on Model Building via Multiple Regression. Part I (The Elbow Rule)

3 views
Skip to first unread message

Reef Fish

unread,
Feb 10, 2006, 5:08:34 PM2/10/06
to
Preface.

I really want the subject to read "UsefulTips on Model Building via
Multiple Regression that you rarely (or sometimes never) find in
textbooks on the subject". But that's much too long to use as the
title in the subject field, especially because I intend to write a
series
of these tips, labeled as Part I, II, etc. (with keywords about the
specific Tip itself).

While some (or many) of these Useful Tips may be foreign to you in
your Multiple Regression experience in learning and practice, most
of them are NOT my original ideas, but ideas from other emninent
statisticians I've known, and from whom I have learned; and these
Useful Tips are in turn taught by me in various versions of my "Data
Analysis Lecture Notes" (unpublished) which I used as graduate level
textbooks (supplemented by conventional textbooks) in the teaching
of my "Data Analysis" courses since about 1972 until my early
retirement in 1999.

ALL of my graduate students in such courses are familiar with ALL
of these Tips (from non-stat majors; to MBA students; to advanced
Ph.D. students in Statistics), in the course of doing their ACTUAL
analysis of Real Data of THEIR choice (not the boring "textbook
examples" that are usually "pre-whitened" and seldom face any of
the real problems one encounters in practice).

What motivated me to take the time to present some of these
Useful Tips is that after MONTHS of my presentations of the
standard, BASIC, undergrad level material relating to Multiple
Regression, that had been misunderstood or abused by some
posters in these sci.stat.* newsgroups, (and about 10 times the
amount of time replying to invalid "arguments" from a few of the
vocal posters who excel in the malpractice of statistics), I think
we (especially the sci.stat.math group where much of the
preliminary material can be found in the archives) are at a point
in which these Useful Tips can be presented, to fill the gap of
some questions that had not yet been asked -- but clearly
lurking and wanting in the background.

The immediate trigger event was my latest post, prompted by
Bruce Weaver, to clarify and summarize the definitions and
essential points behind the terms "linear dependence",
"collinearity", and "multicollinearity" that are frequently
encountered in Multiple Regression but seldom understood
(fully) even by those who has been doing regression for years.

http://tinyurl.com/co4mp

I made these statements in that post:

"The multicollinearity condition and all its associated ill effects
are caused by the same thing -- nearly REDUNDANT information
in the X space of predictor variables. If one of the X's can
be expressed "almost exactly" as the linear combination of one
or more of the OTHER X's in model to be fitted, just GET RID
of that redundant variable! Or the redundant variables if there
are more than one. "

"That was precisely the "problem" in the Longley data. Keeping
too many redundant predictor variables. By "judiciously" <tm>
dropping the redundancy that caused the multicollinearity
(near singularity) conditions in a regression problem, a stable
solution can always be found."

Sooner or later, some perceptive reader(s), such as Bruce, will
ask one or more of these questions that were not answered nor
hinted on "how do you do it" as in "judiciously"? The use of <tm>
was my joke to indicate there's much in that trademark that had
not been covered yet.

These unasked questions might include:

1. How do we identify the redundant variables in a regression
that causes the ill-effects of multicollinearity? (to drop)

2. What are some indicators of the severity of any multicollinearity
condition that clearly point to redundancy, and WHY
some variables should be dropped (by reason of "over-
fitting")?

3. How do we decide how many independent (predictors) we
should use when we have a large set of candidates?

4. How do we know when it is "judicious" to drop some of the
candidates in a tentative model, and how do we SELECT
which ones to drop and which ones to keep?

All of these and many other related questions can be answered
by using the Useful Tip of "The Elbow Rule".

I first learned of the term "Elbow Rule" from Joe Krusdal, in
1967, when he was Visiting Professor at Yale, giving a one-
semester course in Multidimensional Scaling, in his usage
in deciding how many dimensions one should use in applying
his nonmetric scaling methods in

"Nonmetric Multidimensional Scaling: A Numerical Method"
Joseph B. Kruskal, Psychometrika, 29:2 (June 1964), pp. 115-129.

The idea is simply this: He has a "Badness of Fit" measure
called the "Stress" associated with each Euclidean dimension
in which MDS is attempted to recover the configuration of
points whose interpoint Encludean distances fits the matrix
of similarity or dissimilarities among a set of objects.

He would go a plot of the "Stress" vs the dimension n, for
n = 1, 2, ... up to some k. It would typically look something
like this (the vertical units are unimportant):

- X
-
-
-
-
-
-
- X
-
- X
-
- X X X X
-
-----1 --- 2 --- 3 --- 4 --- 5 --- ...

What the point shows is that there is a large drop in the
Stress from 1 to 2 dimensions; a smaller drop from 2 to 3,
and the Stress will eventually be 0 (a perfect fit) if a high
enough dimension is fitted to the data.

If you imagine connecting the points Xs in the graph,
then you'll see a clear "Elbow" at n = 4, the 4th Eucldean
dimension, and the Rule says that's the dimension one
should use. Sometimes such a plot does not show any
clear "elbow", but it is does, what it means is that the
marginal gain from any added dimension is negligible,
and hence one should STOP in some dimension at
or before the Elbow occurs.

The Elbow Rule, applied to a Multiple Regression problem
with "standard deviation of residuals" in the vertical axis
versus the number of predictor variables in the horizontal
axis turns out to be an immensely useful tool to help
answer any of the questions I posed earlier.

In Multiple Regression, the plot can actually INCREASE
after it reached some low point. What that means is
that by using an additional parameter, the sum of
squares of the residuals may not decrease enough to
compensate for the reduction of the denominator from
n to (n-1) to make the standard deviation of residuals
to increase. That's the point beyond which I call the
"forbidden reagion" for fitting -- because it is an
unmistakable sign of "over-fitting".

Typically, in a multiple regression problem, the number
of predictors to use is much smaller than the turn-up
point on s, but at the point of the Elbow, or in the case
of no clear elbow, some point where the slope of the
tangent of the line/curve becomes close to zero. That's
where the ART of model-building comes in. Experience
will guide the user in making the judgment of how many
variables to keep (or drop) in the fitted model.

When the number of variables to use in fitting model is
not clear, must users would try some variable selection
procedures such as Forward Selection, Backward
Elimination, Stepwise (Forward and Backward) procedure,
or All Possible Subsets.

FORW, BACK, or STEP generally result in different
combinations of variables for the best subset for each
dimension, but even if they yield the same ones, the
combinations may still NOT be the best (in terms of
goodness of fit).

In the light of our present understanding about
"multicollinearity", while we want to AVOID having
predictor variables that are multicollinear with each
other, we are seeking multicollinearity between the
X's and the DEPENDENT variable Y, because that
would mean that Y is nearly a linear combination
of the X's -- which is what we WANT.

That's the overwhelming reason WHY the Forward
and Stepwise selection procedures should be avoided,
because each starts with the X that has the highest
SIMPLE correlation with the dependent variable Y.

In FORW selection, htat chosen X, say X3, will stay
in the model no matter what -- so that if Y fits
X1 and X2 perfectly (Y and X1, X2 are linearly
dependent) you'll NEVER find it in Forward Selection.
Stepwise selection which allows an entered variable
to be dropped usually doesn't do much better.

For the reason of seeking LINEAR DEPENDENCE of
Y on some subset of the X's, the BACKWARD
elimination is much preferred, because it starts with
a full model (usually overfitting or suffer from the
ill effects of multicollinearity among the X's), drops
the LEAST useful X, in the presence of all other X's,
and continue that process. Thus if Y is nearly
linearly dependent on the COMBINATION of several
X's, while the correlation with each individual X may
be low, the BACKWARD selection procedure will
often, and nearly ALWAYS, find that combination.

If the number of predictor candiate is small to moderate,
such as 10, doing ALL possible subsets and then plot
the s (SE) of the best one or two of each dimension on
the s versus k (number of indep vars) plot will enable one
to use the Elbow Rule, knowing the actual "best fit"
inhttp://www.itc.virginia.edu/research/talks/sa01_05.pdf
each dimension, while allowing a SUBJECTIVE override
of choosing some comtination which doesn't fit best, but
have much better behavior of the residuals, or some
other external criteria to choose as the "best" fitted model
to use.

Regarding the use of ALL possible subsets to select one
combination to use, there are many analytic criteria
proposed, such as Mallow's Cp (1973), Hocking's Jp
(1976), and a host of others (see e.g.

http://www.itc.virginia.edu/research/talks/sa01_05.pdf

or in he textbook by Cook and Weisberg

http://www.stat.umn.edu/arc/

or other Applied Regression textbooks.

In my opinion, while those analytic measures have their
theoretical justification and merits, I believe they are not
nearly as effective and intuitively appealing as applying the
Elbow Rule to some fitting criterion, in a fitting criterion
vs dimension (number of fitted parameters) plot.

-- Reef Fish Bob.

P.S. If any of you have seen a similar discussion of the
"Elbow Rule" in any Applied Regression Analysis textbook,
please give the reference so that I can use it for future
reference on the subject. As I said, I have read many books
on Regression Analysis, but have not seen any explicit
mention of Kruskal's "Elbow Rule" applied in an entirely
different context from the variable selection topic in Regression,

Jerry Dallal

unread,
Feb 10, 2006, 7:08:22 PM2/10/06
to
Reef Fish wrote:
>
> P.S. If any of you have seen a similar discussion of the
> "Elbow Rule" in any Applied Regression Analysis textbook,
> please give the reference so that I can use it for future
> reference on the subject. As I said, I have read many books
> on Regression Analysis, but have not seen any explicit
> mention of Kruskal's "Elbow Rule" applied in an entirely
> different context from the variable selection topic in Regression,
>

Bob,

I'm eagerly looking forward to the rest of the series. While there are
many ways to come up with a good predictive model, I'm eager to see how
you address the question of determining which variables are driving the
system (as opposed to being surrogates). My stance is that one *can't*
without designing a better study.

Kruskal's "Elbow Rule Plots" are called Scree Plots in some quarters,
"scree" being "A slope of loose rock debris at the base of a steep
incline or cliff." The plots are not uncommon. See, for example, p477
of Neter, Wasserman, and Kutner "Applied Linear Statistical Models, 3rd"
for the plot, but not the name. (I don't have the latest edition at
home.) The discussion is not extensive, but it is there.

Texts are now starting to take notice of "all possible regressions".
Even if a text doesn't use plots, they've got to describe a procedure
that accomplishes much the same thing in order to make sense of the
numbers that are generated.

Frank Harrell calls them scree plots on page 161 of his "Regression
Modeling Strategies" (section 8.6: Data Reduction Using Principal
Components).

To be fair to stepwise and forward selection regression--NOT THAT I
WOULD EVER RECOMMEND THEM--it's worth noting that in the early days they
were offered as a way to deal with a situation where there were more
predictors than observations. In that case, (straightforward) backwards
elimination is not an option.

--Jerry

Reef Fish

unread,
Feb 11, 2006, 12:14:11 AM2/11/06
to
Jerry Dallal wrote:
> Reef Fish wrote:
> >
> > P.S. If any of you have seen a similar discussion of the
> > "Elbow Rule" in any Applied Regression Analysis textbook,
> > please give the reference so that I can use it for future
> > reference on the subject. As I said, I have read many books
> > on Regression Analysis, but have not seen any explicit
> > mention of Kruskal's "Elbow Rule" applied in an entirely
> > different context from the variable selection topic in Regression,
> >
>
> Bob,
>
> I'm eagerly looking forward to the rest of the series. While there are
> many ways to come up with a good predictive model, I'm eager to see how
> you address the question of determining which variables are driving the
> system (as opposed to being surrogates). My stance is that one *can't*
> without designing a better study.

Your comment reminded me that I should have stated early in my post
that I was speaking exclusively about the "model building" aspects of
using multiple regression to find a good predictive model, and nothing
more than that. These are for the most part not "designed" studies
to
address specific causal or driving force behind certain system or
mechanisms.

Rather, they are observational studies based on available data and
what are actually used by the professionals in the trade or profession.

A good example of this may be the market value of a house. Just think
of your city or a section within your city. If you go to any real
estate
agency to look at the houses listed, you'll invariably find many
easily AVAILABLE data about each house: the number of sq. feet,
the number of bedrooms, bath rooms, the age of the house, etc.,
together with the asking price. The question may be this: can you
find a better model to fit the selling price (which often differs from
the
asking price, but the information is available) of a house than the
crude estimate of $ per sq. foot, which is also often used as a rough
estimate to build a new house.

That is an example of a using regression to find a good predictive
model. The model doesn't "explain" anything, but only finds certain
readily available "surrogates" (or what others and I call "proxy"
variables)
to predict something which is KNOWN to be driven by something ELSE,
such as the quality of the building material, and other less tangible
or even unmeasurable characteristics. All you care about is to
sort out from a large number of EASILY avaliable predictors to arrive
at some usable prediction model.

An even better example of this is the prediction of student's college
GPA after a year or two after their admission, based on the information
on their applications for admission -- in order to find a criterion
(model)
to admit qualified students who will have a reasonable change NOT
to drop out after a year or two. Almost all major colleges and
universities have some kind of models for their admission based on
such variables as: SAT scores, high school GPA, rank in the
graduating class, and other aptitude or performance data. But
none of those are the REAL information that will predict a student's
success in college some of which are: How motivated is the student
in learning; how hard working is the student; how intelligent is
the student (and I don't mean IQ score either, which is a proxy
for intelligence), and so on. But those "explanatory" or "causal"
variables are NOT available or measurable. So, using what's
available is the best one can do, knowing that they don't explain
anything, but try to do a good prediction job.

Of the literallly thousands of real data sets on real problems that
have been analyzed by my students in the Data Analysis course,
they are of these predictive (or associative) type. Not the type
based on designed studies to assertain the real mechanism or
cause -- and as we know those studies have their own problems
of inability to do any REAL controlled and designed studies using
the REAL subjects of human (for obvious reasons).

So, I am not touching any of those more controversial subjects
in my "model building" discussion -- ONLY those whose sole
objective is to be able to find a combination of avaiable data to
predict some variable well, if it's possible at all.

>
> Kruskal's "Elbow Rule Plots" are called Scree Plots in some quarters,
> "scree" being "A slope of loose rock debris at the base of a steep
> incline or cliff."

That's the first time I've heard of that term!

> The plots are not uncommon. See, for example, p477
> of Neter, Wasserman, and Kutner "Applied Linear Statistical Models, 3rd"
> for the plot, but not the name. (I don't have the latest edition at
> home.) The discussion is not extensive, but it is there.

I've taught a senior-level undergrad course from several editions
of the Neter (U of Ga), Wasserman, later joined by Kutner (Emory)
and Nachtsheim, from the 1st edition through the 3rd, but I can
only recall that their discussion of the model building and variable
selection parts of regression as being quite superficial. Of course
I don't have ANY of those textbooks or any other textbooks now
to look at any of the pages because I've given all my books away. :-)

> Texts are now starting to take notice of "all possible regressions".

Nah. Textbooks that talk about all possible regressions have been
around for a LONG time -- how else would I have known about
the Hamiltonian path to compute all possible regressions before
1970? Even the oldies like the first edition of Draper and Smith
showed all possible regressions on the Hald data (I remember
that example well).

The fact that you know the term "scree" and I don't suggest that
you may be more familiar with textbooks outside of the "statistics
proper" discipline, or more recent textbooks than those I've seen.

> Frank Harrell calls them scree plots on page 161 of his "Regression
> Modeling Strategies" (section 8.6: Data Reduction Using Principal
> Components).

Regression Using Principal Components is not a topic I would
consider to be within the "standard areas of coverage" of a
Regression Analysis course.

> To be fair to stepwise and forward selection regression--NOT THAT I
> WOULD EVER RECOMMEND THEM--it's worth noting that in the early days they
> were offered as a way to deal with a situation where there were more
> predictors than observations.

I've NEVER seen an example of THAT kind of nonsense in overfitting,
nor have I seen it offered anywhere as an excuse for the Forward
Selection method. Jerry, I infer we must have lived in very different

worlds of statistics. :-)

It's bad enough to start with data whose sample size n is no more
than double or triple the size of p (number of predictors), to have
sufficient degress of freedom left for a reasonable treatment of the
inference part. I would characterize regression problems in which
p is greater than n "overfitting with a vengeance", in the same
sense as Tukey said of the use of regression coefficients as
"sweeping dirt under the rug with a vengeance".

> In that case, (straightforward) backwards elimination is not an option.

There are some large data sets where p may be sufficiently large
to make backward elimination a problem because of the
storage size constraints by a software program or a computer.
But those are very special situations ourside of my intended
scope of discussion.

Thanks for your coments which allowed me to better specify and
restrict my intended discussion to the "predictive" aspects of
model building (in the sense used by Box, Tukey, and Schatzoff
("from whom I learned much from his COSMO software at
MIT and Harvard, one of the earliest "Console Oriented Model
Building" softwares in the 1960s,

See: http://www.stat.harvard.edu/People/Department_History.html
"Dr. Martin Schatzoff, a department alumnus and IBM staff member,
was hired to teach graduate students how to use the computer
for data analysis and to explore the use of computers for teaching"

long before there was any
widely available "time-shared" computing for interactive usage
(when IBM was struggling with its TSO (time-shared Operating
System; while Yale was trying to build its own OS from scratch
-- CYTOS (Conversational Yale Terminal Operating System),
which terminated Cark Roessler's career as the DIrector of the
Yale Computer Center (before 1970) into a much more successful
professional Underwater Photographer, after he was invited by
Yale to leave. :-)

-- Bob.

Jerry Dallal

unread,
Feb 11, 2006, 6:04:33 AM2/11/06
to
Reef Fish wrote:
> Jerry Dallal wrote:

>> To be fair to stepwise and forward selection regression--NOT THAT I
>> WOULD EVER RECOMMEND THEM--it's worth noting that in the early days they
>> were offered as a way to deal with a situation where there were more
>> predictors than observations.
>
> I've NEVER seen an example of THAT kind of nonsense in overfitting,
> nor have I seen it offered anywhere as an excuse for the Forward
> Selection method. Jerry, I infer we must have lived in very different
>
> worlds of statistics. :-)
>

It's hard to remember back that far, but my memory saya that while that
may not have been *the* reason given to justify forward selection
regression, it was certainly one of them. Since you're so much older
than I :-), you saw the technique closer to the time when it was first
introduced, whereas I came along at a point where practitioners had
thought of a few cases where it might actually be useful.

> It's bad enough to start with data whose sample size n is no more
> than double or triple the size of p (number of predictors), to have
> sufficient degress of freedom left for a reasonable treatment of the
> inference part. I would characterize regression problems in which
> p is greater than n "overfitting with a vengeance", in the same
> sense as Tukey said of the use of regression coefficients as
> "sweeping dirt under the rug with a vengeance".

However, today it's not an uncommon situation, especially in biomedical
research (and in some quarters it's always been the case!), even when
the goal is only to find a predictive model. Many tests and measurement
procedures are automated. Ask for X1, get X1-X97. Draw a tube of blood
for a white blood count and it's only a push of a button to get a
complete profile from nutrients to lipids. Does one throw away the
data? How can the additional variables be used sensibly? (I ask the
questions rhetorically, for the moment.)

Move over to the search for causal mechanisms, and you've got the
sciences of genomics and bioinfomatics trying to tackle this question
with a vengeance...and not doing very well, mainly because there's
nothing much that can be done for reasons you give in your last
(unsnipped) paragraph.

Your going on your tear has prompted me to start one of my own (which
I'll take up over on my web pages), tackling the problem from a
different angle. My concern is that even when the techniques are done
textbook-perfectly in the manner you (will) describe, the results may be
garbage because the analytic technique is inconsistent with the research
question. It's easy to see how this happens. Statisticians typically
present a technique in a mathematical context, leaving it up to the
practitioner to see that it is applied properly. However, the ready
availability of computers and statistical program packages means that
anyone with the money (or the ability to get someone else to pay) can
apply the methods without any regard of how they were intended to be used.

Jerry Dallal

unread,
Feb 11, 2006, 6:05:34 AM2/11/06
to
Reef Fish wrote:
> Jerry Dallal wrote:

>> To be fair to stepwise and forward selection regression--NOT THAT I
>> WOULD EVER RECOMMEND THEM--it's worth noting that in the early days they
>> were offered as a way to deal with a situation where there were more
>> predictors than observations.
>
> I've NEVER seen an example of THAT kind of nonsense in overfitting,
> nor have I seen it offered anywhere as an excuse for the Forward
> Selection method. Jerry, I infer we must have lived in very different
>
> worlds of statistics. :-)
>

It's hard to remember back that far, but my memory saya that while that

may not have been *the* reason given to justify forward selection
regression, it was certainly one of them. Since you're so much older
than I :-), you saw the technique closer to the time when it was first
introduced, whereas I came along at a point where practitioners had
thought of a few cases where it might actually be useful.

> It's bad enough to start with data whose sample size n is no more


> than double or triple the size of p (number of predictors), to have
> sufficient degress of freedom left for a reasonable treatment of the
> inference part. I would characterize regression problems in which
> p is greater than n "overfitting with a vengeance", in the same
> sense as Tukey said of the use of regression coefficients as
> "sweeping dirt under the rug with a vengeance".

However, today it's not an uncommon situation, especially in biomedical

Greg Heath

unread,
Feb 11, 2006, 12:51:37 PM2/11/06
to

Reef Fish wrote:
> Jerry Dallal wrote:
> > Reef Fish wrote:

-----SNIP


> Regression Using Principal Components is not a topic I would
> consider to be within the "standard areas of coverage" of a
> Regression Analysis course.
>
> > To be fair to stepwise and forward selection regression--NOT THAT I
> > WOULD EVER RECOMMEND THEM--it's worth noting that in the early days they
> > were offered as a way to deal with a situation where there were more
> > predictors than observations.
>
> I've NEVER seen an example of THAT kind of nonsense in overfitting,
> nor have I seen it offered anywhere as an excuse for the Forward
> Selection method. Jerry, I infer we must have lived in very different
> worlds of statistics. :-)
>
> It's bad enough to start with data whose sample size n is no more
> than double or triple the size of p (number of predictors), to have
> sufficient degress of freedom left for a reasonable treatment of the
> inference part. I would characterize regression problems in which
> p is greater than n "overfitting with a vengeance", in the same
> sense as Tukey said of the use of regression coefficients as
> "sweeping dirt under the rug with a vengeance".
>
> > In that case, (straightforward) backwards elimination is not an option.
>
> There are some large data sets where p may be sufficiently large
> to make backward elimination a problem because of the
> storage size constraints by a software program or a computer.
> But those are very special situations ourside of my intended
> scope of discussion.

Nevertheless, how would you advise someone with n =158 cases of
stellar spectra with measurements at p = 2001 wavelengths? The
objective is to design a logistic model to predict one of 5 luminosity
classes.

Hope this helps.

Greg

Paige Miller

unread,
Feb 11, 2006, 1:11:37 PM2/11/06
to
On 2/11/2006 12:51 PM, Greg Heath wrote:
>
> Nevertheless, how would you advise someone with n =158 cases of
> stellar spectra with measurements at p = 2001 wavelengths? The
> objective is to design a logistic model to predict one of 5 luminosity
> classes.

Partial Least Squares with dummy variables as the Y values, or
variants thereof, fits the application perfectly. You will probably
find examples of similar applications in the literature. I did one
myself a while back.

--
Paige Miller
pmil...@rochester.rr.com

It's nothing until I call it -- Bill Klem, NL Umpire
If you get the choice to sit it out or dance,
I hope you dance -- Lee Ann Womack

Reef Fish

unread,
Feb 11, 2006, 1:13:05 PM2/11/06
to

I would advice that some to look into some special techniques in
spectral analysis, and that Multiple Regression is NOT the proper
tool to use.
>
> Hope this helps.

It does, Greg. To help people realize that the multiple regression
model building methods are not the panecea to all modelling problems,
no more so than a steamroller is suitable to iron a dress shits.

-- Bob.

Reef Fish

unread,
Feb 11, 2006, 2:58:24 PM2/11/06
to
Your two consecutive posts were 1 minute apart in my google newsreader.
So, I assume that was either an unintended duplicate or whatever you
modified is in this version.

Jerry Dallal wrote:
> Reef Fish wrote:
> > Jerry Dallal wrote:
>
> >> To be fair to stepwise and forward selection regression--NOT THAT I
> >> WOULD EVER RECOMMEND THEM--it's worth noting that in the early days they
> >> were offered as a way to deal with a situation where there were more
> >> predictors than observations.
> >
> > I've NEVER seen an example of THAT kind of nonsense in overfitting,
> > nor have I seen it offered anywhere as an excuse for the Forward
> > Selection method. Jerry, I infer we must have lived in very different
> >
> > worlds of statistics. :-)
> >
>
> It's hard to remember back that far, but my memory saya that while that
> may not have been *the* reason given to justify forward selection
> regression, it was certainly one of them. Since you're so much older
> than I :-),

No, I am only a tad older than you (our attendance at the same Stat
Department ieven overlapped, in the years it took you 7 years to
my 4 to complete the Ph.D. degree :-),

But on your "older than I" statement, did you and G. W. Bush attend
the same English class? :-))


> you saw the technique closer to the time when it was first
> introduced, whereas I came along at a point where practitioners had
> thought of a few cases where it might actually be useful.
>
> > It's bad enough to start with data whose sample size n is no more
> > than double or triple the size of p (number of predictors), to have
> > sufficient degress of freedom left for a reasonable treatment of the
> > inference part. I would characterize regression problems in which
> > p is greater than n "overfitting with a vengeance", in the same
> > sense as Tukey said of the use of regression coefficients as
> > "sweeping dirt under the rug with a vengeance".
>
> However, today it's not an uncommon situation, especially in biomedical
> research (and in some quarters it's always been the case!), even when
> the goal is only to find a predictive model. Many tests and measurement
> procedures are automated. Ask for X1, get X1-X97. Draw a tube of blood
> for a white blood count and it's only a push of a button to get a
> complete profile from nutrients to lipids. Does one throw away the
> data? How can the additional variables be used sensibly? (I ask the
> questions rhetorically, for the moment.)

My answer would be -- it you ask for X1, and get X1-X97, then throw
the X2-X97 away.

Statistical data analysis is NOT like Mount Everest climbing -- where
"because it's there" is a good enough reason for the adventurous
explorer, but not good for something that requires commonsense
and good judgment besides technical skills.

>
> Move over to the search for causal mechanisms, and you've got the
> sciences of genomics and bioinfomatics trying to tackle this question
> with a vengeance...and not doing very well, mainly because there's
> nothing much that can be done for reasons you give in your last
> (unsnipped) paragraph.

I do believe in the "you can't squeeze blood out of a turnip" Law.


> Your going on your tear has prompted me to start one of my own (which
> I'll take up over on my web pages), tackling the problem from a
> different angle.

I certainly look forward to reading that.

> My concern is that even when the techniques are done
> textbook-perfectly in the manner you (will) describe, the results may be
> garbage because the analytic technique is inconsistent with the research
> question.

A minor clarification for your use of "you (will)" above. While that
generic "you" of yours do apply to many of our readers here, and
even apply to some/many who publish in certain medical/biometical
journals, that "you" CANNOT possibly be this ME.

That is one of the reasons WHY a competent USER of statistical
methods require as much specialized training as a medical doctor
or a neuro-surgeon. That is also the reason WHY there is a
joint-program at Harvard for MEDICAL researchers that make
serious use of statistics to have BOTH a Harvard M.D. degree
AND a Harvard Ph.D. degree in Statistics. I had one Harvard
M.D. in that program in my Data Analysis course for the
Statistics Ph.D. program when I taught there. The M.D. prepares
the research for the medical knowledge; while the Ph.D. prepares
the same researcher for doing the statistical analysis properly
to leave little room for the "Garbage in; Garbage out" sydrome.

> It's easy to see how this happens. Statisticians typically
> present a technique in a mathematical context, leaving it up to the
> practitioner to see that it is applied properly.

See the above paragraph. A competent practitioner who makes
use of statistics as the research tool needs to be competent in
BOTH the subject matter (medicine, biology, or whatever), AND
the field of Statistics.

In the Real World, it is unfortunate that MOST of the mal-
practioners of statistics may be quite competent in their SUBJECT
matter of sociology, economics, or epidemiology, but grossly
underestimated their need to acquire competence in the use
of statistical methods -- which take FAR more (of which the
Harvard Ph.D in Stat or equivalent may be the MINIMUM
requirement) than having an SPSS Manual and reading a
chapter or two from some statistics books.

> However, the ready
> availability of computers and statistical program packages means that
> anyone with the money (or the ability to get someone else to pay) can
> apply the methods without any regard of how they were intended to be used.

I agree with you 100% on your assessment of the REALITY. It
doesn't take a "rocket scientist" (I don't know how/where that
expression originated) or even a competent undergraduate
trained in Statistics to recognize THAT reality.

One of the most commonly recognized quote is that by Bejamin
Desrali, "There are Lies, Damned Lies, and Statistics." What
was missing in that line is the qualification of "Statistics" by
"Statistics produced or interpreted by those untrained in the
proper use of statistics".

For a modern day application of Desrali's quote, see :-)

http://tinyurl.com/aj7ro

*> Benjamin Desrali, British Prime Minister in the 19th century would
remark.
*> "There are lies, damned lies, and then there are statistics"
*> For Bush guilty on all three counts.

If Desralie had quoted the missing part I inserted above, his quote
would have been universally ignored or overlooked as this remark
by Karl Pearson:

Pearson> there is QUACKERY in science as there is quackery in
Pearson> medicine And EVEN where there is no quackery there is
Pearson> IGNORANCE and DOGMA parading before the public
Pearson> as knowledge

DZ> (no, it's not Reef Fish 2005 on sci.stat.math, and yes I took
DZ> liberty to capitalize a few random words :-)


More fully (cited from my post on June 23, 2005),
quoted from Pearson, by DZ, from Biometrika, 29:161-248).

"As I grow older I feel more and more need not only for the censores
morum, but for censores scientiarum, a species of watch dogs of
science, whose duty it shall be not only to insist upon HONESTY and
LOGIC in scientific procedure, but who shall warn the public against
appearances of knowledge where we are as yet in a state of
ignorance. In this age of self-advertisement, when an individual may
become famous in twenty four hours by aid of the illustarated daily
press, there is QUACKERY in science as there is quackery in
medicine. And even where there is no quackery there is IGNORANCE and
DOGMA parading before the public as knowledge, and taking its TOLL
from the community by a multiplicity of devices. In many ways the
trained scientific man can WARN the public, even when it lacks
acquaintance with specialized detail... Unfortunately at the present
time no theory of what we may term scientific logic is taught to
students of science in our universities, and the result is only too
patent in 50% and more of so-called scientific publications."


In my follow-up post to DZ's, I wrote,

RF> I wish I could have said it half as eloquently as this author did.


RF> I didn't realize I was in such good company of Karl and Egon
Pearson,
RF> and that I have been preaching in sci.stat.*, and playing the same
RF> Watch Dog role as they did. I pointed out the "scientific
RF> quackery" committed REGULARLY by some of the prolific posters in
RF> the sci.stat.* newsgroups


Re-capitulating what Jerry had said earkuer in his post:

> Your going on your tear has prompted me to start one of my own (which
> I'll take up over on my web pages), tackling the problem from a
> different angle.

Jerry, I thought you meant you woud start on the angle of Tips on
the proper application of designed experiments to ascertain "cause"
and other "effects" in medical/biometical studies.

Perhaps you did meant that.

But the rest of your post, you seemed to have taken a detour to
one in which I've been touring for months in these groups, about
what contributed to the malpractice and quackery in Statistics, as
it had been eloquently said by Karl Pearson long before even
*I* was born! :-)

Perhaps you should give a brief PREVIEW of your coming
attraction, by clarifying your statement of

"tackling the problem from a different angle".

In any event, I am glad to see you mention and your recognition of
some of the ongoing ills and their cause, n the (mal)practice of
Statistics, and sang a tune which I had more or less thought to
have been doing a solo to a disgruntled audience, with the
ckear exception of the poster DZ, cited above.

-- Reef Fish Bob.

Jerry Dallal

unread,
Feb 11, 2006, 4:31:42 PM2/11/06
to
Reef Fish wrote:

> Perhaps you should give a brief PREVIEW of your coming
> attraction, by clarifying your statement of
>
> "tackling the problem from a different angle".

I believe that for any method to be understood properly, it has to be
discussed in the context of research questions. As you noted earlier,

"I should have stated early in my post that I was speaking exclusively
about the 'model building' aspects of using multiple regression to find
a good predictive model, and nothing more than that. These are for the
most part not 'designed' studies to address specific causal or driving
force behind certain system or mechanisms."

Had someone come to you with a data from a designed study and certain
research questions about causality to address, you would not be
instructing them on "Tips on Model Building via Multiple Regression.
Part I (The Elbow Rule)". My concern is that even when investigators
know the mechanics of methods well, they are often shaky about when to
apply them.

The "different angle" comes from being unable to imagine myself starting
a discussion about modeling without beginning with something like your
comment that I quoted in the last paragraph. Before I start talking
about any specific techniques, I tell my students that "the analysis
must be consistent with the research question". I even force them to
say it as a group in a loud voice many times, to impress its importance
upon them. As they are learning about techniques, I want them thinking
constantly about when it would be appropriate and inappropriate to apply
them.

I doubt I'll say anything different from you, but the perspective and
emphasis may be different.

Greg Heath

unread,
Feb 11, 2006, 7:04:11 PM2/11/06
to

Paige Miller wrote:
> On 2/11/2006 12:51 PM, Greg Heath wrote:
> >
> > Nevertheless, how would you advise someone with n =158 cases of
> > stellar spectra with measurements at p = 2001 wavelengths? The
> > objective is to design a logistic model to predict one of 5 luminosity
> > classes.
>
> Partial Least Squares with dummy variables as the Y values, or
> variants thereof, fits the application perfectly. You will probably
> find examples of similar applications in the literature. I did one
> myself a while back.

Hi Paige,

I knew that would be your answer (I've asked you about this before).
I just wanted to see if from Bob and/or Jerry had any feasible
alternatives.

Thanks.

Greg

Greg Heath

unread,
Feb 11, 2006, 7:21:45 PM2/11/06
to

Curiously, principal coordinate analysis for dimensionality reduction
and neural networks (with multiple logistic regression a special case
of no hidden layers) for classification seems to be the most common
approach.

Hope this helps.

Greg

Jerry Dallal

unread,
Feb 11, 2006, 8:07:55 PM2/11/06
to

Greg,

I haven't a clue about what your research question is. I've no sense of
what it means to use wavelength measurements of stellar spectra to
predict one of 5 luminosity classes. I infer that there's some way to
determine luminosity class other than through this model or you wouldn't
know what you were predicting. If you had been a potential client
approaching me to take on a consulting contract, I'd suggest trying
someone else because of my lack of any familiarity with your field. If
you insisted and somehow got me to agree despite my better judgment,
you'd be spending a *lot* of time (on the clock) bringing me up to speed.

In general, dimensionality reduction techniques--principal components,
factor analysis, canonical correlation coefficients, and PLS--are
ill-advised because there's no guarantee that the results will even be
related to the research question, let alone the answer to it. For
example, with principal components regression, the response could turn
out to be essentially the last principal component! All PC regression
would give you is a set of predictors orthogonal to what you're looking
for!

However, the last paragraph is a generalization. It may well be that
there's a problem somewhere for which these techniques will provide the
answer. But, since I don't understand your problem, there's no way I
could suggest a solution.

That said, you might get "lucky", that is, what you're looking for might
be so dramatic that it jumps out no matter what you do. Unlikely, but
then it wouldn't be called "luck" if it happened.

Reef Fish

unread,
Feb 11, 2006, 10:26:34 PM2/11/06
to

Jerry Dallal wrote:
> Reef Fish wrote:
>
> > Perhaps you should give a brief PREVIEW of your coming
> > attraction, by clarifying your statement of
> >
> > "tackling the problem from a different angle".
>
> I believe that for any method to be understood properly, it has to be
> discussed in the context of research questions.

I don't think you are using the term "research questions" quite
properly,
from what you said in your NEXT paragraph, and the rest of your
post, having cited my paragraph below.

> As you noted earlier,
> "I should have stated early in my post that I was speaking exclusively
> about the 'model building' aspects of using multiple regression to find
> a good predictive model, and nothing more than that. These are for the
> most part not 'designed' studies to address specific causal or driving
> force behind certain system or mechanisms."

I made it clear that I was addressing "predictive" questions. However
that made it no less "research questions" than "research questions"
that involves a designed study involving causality.

As a matter of fact, the examples I cited are certainly "research
questions" -- doing the data analytic research in search of useable
or useful predictive models for: "Price of Housing" in the first
example, and "GPA of students after one year of admision at a
particular university". I would go so far as to say that ALL the
model building projects my students did in the Data Analysis
course were "research questions" -- on how to find models to
predict various things that are used in the Real World. Not
textbook exercises on contrived numbers.

>
> Had someone come to you with a data from a designed study and certain
> research questions about causality to address, you would not be
> instructing them on "Tips on Model Building via Multiple Regression.
> Part I (The Elbow Rule)".

You clarified yourself of your mis-spoken use of "research questions".

What you really meant was "research questions about causality based
on a designed study".

Then, that's an entirely different setting for a discussion. It's no
longer
even in the realm of a discussion in REGRESSION -- because that
may not even be appropriate or applicable!

So, I am glad that your clarified what you meant, at my goading, and
while I'll be glad to discuss my OPINION on the scarcity of VALID study

of this kind and the abundance of INVALID approaches that have
often been used, let's make it clear that we are NO LONGER in
the realm of "model building" in the sense I had defined the usage
(ala Box, Tukey, Schatzoff, and including myself), and the Tips
about the use of Multiple Regression within the context of RESEARCH
in seeking predictive models (without worrying about explanatory
cause).

I think that's all the clarification we need at this time, and I am
glad
I asked.


> The "different angle" comes from being unable to imagine myself starting
> a discussion about modeling without beginning with something like your
> comment that I quoted in the last paragraph. Before I start talking
> about any specific techniques, I tell my students that "the analysis
> must be consistent with the research question".

This merely confirmed your BIAS and MIS-USE of the term "research
question". Read the comment you quoted again, more carefully this
time. I don't think you'll argue that the paragraph you cited is
inconsistent with anyone's definition of a "research question" other
than what you chose to consider only the narrow meaning about
"certain research" that is NOT of the purely predictive kind.


You can go back to the beginning of science and you'll see
"causal research" is only a very small part of all research, and the
ones that had been attempted in "causal research" had been done
VERY badly, in my opinion.


Aristotle and Ptolemy researched on the planetary motion in
terms of cycles and epicycles;

http://csep10.phys.utk.edu/astr161/lect/retrograde/aristotle.html

Copernicus, Keplar, Galileo, and many other early scientists
researched on the planetary motion with simpler models
(ellipses) that proved to be more correct apart from simplicity.

http://tinyurl.com/ahhte

These were all PREDICTIVE research. Nobody was trying to
explain the CAUSE of the motion, because some had already
assumed God was the cause of it all.

Actually that's the ANSWER to all YOUR "research questions" --
at least according to some religious zealots. Why would you
need to design any study. ;^) God and Prayers are the CAUSE
of everything on earth.


> I even force them to
> say it as a group in a loud voice many times, to impress its importance
> upon them. As they are learning about techniques, I want them thinking
> constantly about when it would be appropriate and inappropriate to apply
> them.

Nothing wrong with THIS part. To impress upon researchers the
PURPOSE and OBJECTIVE of the research -- and that was what
MY paragraph was about -- except Jerry Dallal erred in his implication
that predictive research questions are NOT research questions.


> I doubt I'll say anything different from you, but the perspective and
> emphasis may be different.

You've already said a mouthful DIFFERENT from what I said
and meant, as I explained above. :-)

I am sure you'll also have ideas very different from mine about YOUR
kind of "research questions about cause, based on designed
experiments".

We just need to make it clear what's what that we are discussing, from
a clear, logical, and definitional points of view.

-- Bob.

Jerry Dallal

unread,
Feb 11, 2006, 11:33:27 PM2/11/06
to
Reef Fish wrote:
> Jerry Dallal wrote:
>> Reef Fish wrote:
>>
>>> Perhaps you should give a brief PREVIEW of your coming
>>> attraction, by clarifying your statement of
>>>
>>> "tackling the problem from a different angle".
>> I believe that for any method to be understood properly, it has to be
>> discussed in the context of research questions.
>
> I don't think you are using the term "research questions" quite
> properly,
> from what you said in your NEXT paragraph, and the rest of your
> post, having cited my paragraph below.
>
>> As you noted earlier,
>> "I should have stated early in my post that I was speaking exclusively
>> about the 'model building' aspects of using multiple regression to find
>> a good predictive model, and nothing more than that. These are for the
>> most part not 'designed' studies to address specific causal or driving
>> force behind certain system or mechanisms."
>
> I made it clear that I was addressing "predictive" questions. However
> that made it no less "research questions" than "research questions"
> that involves a designed study involving causality.
>

Bob,

It's not worth beating to death, but I never said that predictive
modeling can't be a research question. I want the research
question--whatever it is--stated explicitly and understood, from the
top, before anything else. My point was intended to be that the comment
you added after the initial tutorial is one that would have started
mine. I don't think we disagree, only that we might have different ways
of telling essentially the same story.

--Jerry

Reef Fish

unread,
Feb 12, 2006, 2:22:32 AM2/12/06
to

It's the implication of your statememts that carried your UNINTENDED
meaning (as you say now) -- but I have to stand on WHY I read it the
way I did, as I had explained in my post. While English is not my
native language, I am quite certain in this case your written language
betrayed your intention.

I'll have to let that stand.


Look at your statement purely from the point of view of an English
sentence, stating your point about "research quetions" <I merely
masked some qualifying words>

JD> Had someone come to you with a data from <.> certain
JD> research questions about <.>, you would not be instructing
JD> instructing them on "Tips on Model Building via Multiple
Regression.
JD> Part I (The Elbow Rule)".

That's why I said something to THIS effect,

"Why not? Those examples I cited and other projects in my Data
Analysis courses were ALL real data from <student-selected>
research questions about <various predictive projects>, why
would I NOT be instructing them on "Tips on ..." as I did?"

In any event, given what you wrote, whether other readers read it
the way I took as what your meant or not, I wanted to emphasize
this point:

RF> I made it clear that I was addressing "predictive" questions.
However
RF> that made it no less "research questions" than "research
questions"
RF> that involves a designed study involving causality


> My point was intended to be that the comment
> you added after the initial tutorial is one that would have started
> mine. I don't think we disagree, only that we might have different ways
> of telling essentially the same story.

Time will tell.

-- Bob.

Jos Jansen

unread,
Feb 12, 2006, 3:22:00 AM2/12/06
to

"Reef Fish" <Large_Nass...@Yahoo.com> schreef in bericht
news:1139609314....@f14g2000cwb.googlegroups.com...
> Preface.
>

<snip>

>
> Regarding the use of ALL possible subsets to select one
> combination to use, there are many analytic criteria
> proposed, such as Mallow's Cp (1973), Hocking's Jp
> (1976), and a host of others (see e.g.
>
> http://www.itc.virginia.edu/research/talks/sa01_05.pdf
>
> or in he textbook by Cook and Weisberg
>
> http://www.stat.umn.edu/arc/
>
> or other Applied Regression textbooks.
>
> In my opinion, while those analytic measures have their
> theoretical justification and merits, I believe they are not
> nearly as effective and intuitively appealing as applying the
> Elbow Rule to some fitting criterion, in a fitting criterion
> vs dimension (number of fitted parameters) plot.
>

A plot of RSS (residual sum of squares) versus p (the number of fitted
parameters) supports the Elbow Rule in a natural way, and doesn't have the
turn-up behaviour of the plot of s or s^2 versus p. I have never seen a
reference to this way of presentation; probably this is due to the
bewitching attention asked for Mallow's Cp in the literature.

Jos Jansen

Greg Heath

unread,
Feb 12, 2006, 7:41:14 AM2/12/06
to

Reef Fish wrote:

-----SNIP

Bob,

My "how to" interpretation of your post is summarized below:

1. Select an appropriate fitting criterion, FC, to minimize
2. Obtain the best model for each value of q (q = 1,2...p)
3. If p <= pthresh (say 10) use all possible regressions
4. If p > pthresh, use backward elimination
5. Plot FC vs q and apply the elbow rule.

Now, a few questions:

1. What FC do you recommend?
2. Is it worthwhile to use backward stepwise (allowing
a rejected variable to reappear)?
3. What do you recommend if p is too large to consider
obtaining p models via a backward search?
4. What do you recommend if p > n and additional
observations are not available?

Hope this helps.

Greg

Jerry Dallal

unread,
Feb 12, 2006, 8:22:42 AM2/12/06
to
Reef Fish wrote:
> Jerry Dallal wrote:
>> Reef Fish wrote:
>>> Jerry Dallal wrote:
>>>> Reef Fish wrote:

> Look at your statement purely from the point of view of an English
> sentence, stating your point about "research quetions" <I merely
> masked some qualifying words>
>
> JD> Had someone come to you with a data from <.> certain
> JD> research questions about <.>, you would not be instructing
> JD> instructing them on "Tips on Model Building via Multiple
> Regression.
> JD> Part I (The Elbow Rule)".
>
> That's why I said something to THIS effect,
>
> "Why not? Those examples I cited and other projects in my Data
> Analysis courses were ALL real data from <student-selected>
> research questions about <various predictive projects>, why
> would I NOT be instructing them on "Tips on ..." as I did?"
>

Bob,

You surprise me! In many of your posts you claim that others distort
your meaning and therefore choose to print things intact. Yet, in my
post, the omitted "certain qualifiers" change the sense entirely.

I wrote, "Had someone come to you with a data *from a designed study
and* certain research questions about *causality* to address,..."
(*=restored qualifier)

If you are telling me that you would give the "Elbow Rule" lecture to
someone coming to you with "data from a designed study and certain
research questions about causality to address", then I ACCEPT YOUR
CRITICISM UNCONDITIONALLY. (And I will have food for thought, because it
is something I do not see myself doing.) All you have to do is state
unconditionally that you would, in fact, offer the "Elbow Rule" lecture
to someone coming to you with "data from a designed study and certain
research questions about causality to address" and this discussion is
over.

(It should be over anyway. As long as you let my comments stand
unedited, as you request of others, I have nothing to add. Beyond the
first qualifier you added early on to emphasis the focus on predictive
modeling, this exchange must be a distraction from completing your other
notes, which I am eager to read.)

--Jerry

Jerry Dallal

unread,
Feb 12, 2006, 8:43:21 AM2/12/06
to

While I'll leave the definitive response to Bob, the turn up behavior is
A Good Thing (tm) because it emphasizes that one is looking a mere noise.

Paige Miller

unread,
Feb 12, 2006, 8:48:46 AM2/12/06
to
On 2/11/2006 8:07 PM, Jerry Dallal wrote:

> In general, dimensionality reduction techniques--principal components,
> factor analysis, canonical correlation coefficients, and PLS--are
> ill-advised because there's no guarantee that the results will even be
> related to the research question, let alone the answer to it. For
> example, with principal components regression, the response could turn
> out to be essentially the last principal component! All PC regression
> would give you is a set of predictors orthogonal to what you're looking
> for!

Jerry, the advantage of PLS is that it performs dimension reduction by
finding dimensions of X that are predictive of Y. It is impossible in
PLS to have the analogous situation to what you phrase as "the
response could turn out to be essentially the last principal component".

Having said that, there are literally zillions (ok, I'm exaggerating
but I bet there are a thousand) examples of published papers in which
PLS is applied to spectroscopic data, apparently successfully applied
I might add.

Jerry Dallal

unread,
Feb 12, 2006, 9:48:20 AM2/12/06
to
Paige Miller wrote:
> On 2/11/2006 8:07 PM, Jerry Dallal wrote:
>
>> In general, dimensionality reduction techniques--principal components,
>> factor analysis, canonical correlation coefficients, and PLS--are
>> ill-advised because there's no guarantee that the results will even be
>> related to the research question, let alone the answer to it. For
>> example, with principal components regression, the response could turn
>> out to be essentially the last principal component! All PC regression
>> would give you is a set of predictors orthogonal to what you're
>> looking for!
>
> Jerry, the advantage of PLS is that it performs dimension reduction by
> finding dimensions of X that are predictive of Y. It is impossible in
> PLS to have the analogous situation to what you phrase as "the response
> could turn out to be essentially the last principal component".
>
> Having said that, there are literally zillions (ok, I'm exaggerating but
> I bet there are a thousand) examples of published papers in which PLS is
> applied to spectroscopic data, apparently successfully applied I might add.
>

I'm eager to learn more. If there is one response, it would seem that
PLS finds the linear combination of Xs most highly correlated with the
response. Does it differ from multiple linear regression?

If there are many responses, I get the impression that PLS is canonical
correlation like, which adds the complexity of figuring out what,
exactly, is being predicted, that is, how the linear combination of
outcomes might bear on the problem.

There may well be a thousand examples of published papers in which PLS
is applied successfully to spectroscopic data. I don't know. As I began
my response to Greg, I know nothing about the analysis of spectra so am
not in a position to say how they should be analyzed or judge whether
the applications were successful.

At the risk of making another generalization (risky because no
generalization can be applied uniformly), in the fields where I work, if
a researcher has to resort to purely statistical techniques to come up
with response or prediction scales, s/he probably hasn't thought hard
enough about what s/he is trying to accomplish.

Spectra could be a completely different thing. I have to rely on those
who are familiar with their analysis to decide.

Thanks for the input, though. While my current impression is that PLS
is a kind of "voodoo statistics", I'm willing to be convinced otherwise.

Paige Miller

unread,
Feb 12, 2006, 10:56:18 AM2/12/06
to
On 2/12/2006 9:48 AM, Jerry Dallal wrote:
> Paige Miller wrote:
>> On 2/11/2006 8:07 PM, Jerry Dallal wrote:
>>
>>> In general, dimensionality reduction techniques--principal
>>> components, factor analysis, canonical correlation coefficients, and
>>> PLS--are ill-advised because there's no guarantee that the results
>>> will even be related to the research question, let alone the answer
>>> to it. For example, with principal components regression, the
>>> response could turn out to be essentially the last principal
>>> component! All PC regression would give you is a set of predictors
>>> orthogonal to what you're looking for!
>>
>> Jerry, the advantage of PLS is that it performs dimension reduction by
>> finding dimensions of X that are predictive of Y. It is impossible in
>> PLS to have the analogous situation to what you phrase as "the
>> response could turn out to be essentially the last principal component".
>>
>> Having said that, there are literally zillions (ok, I'm exaggerating
>> but I bet there are a thousand) examples of published papers in which
>> PLS is applied to spectroscopic data, apparently successfully applied
>> I might add.
>>
>
> I'm eager to learn more. If there is one response, it would seem that
> PLS finds the linear combination of Xs most highly correlated with the
> response. Does it differ from multiple linear regression?

First let me add one additional piece of wording in the PLS
explanation that I gave earlier. Like Principal Components, PLS tries
to find dimensions in X that have highest variability in the X space,
but PLS also requires these directions to be highly predictive of Y.
Thus, there is a tradeoff between high variability in X and predictive
of Y in the PLS algorithm. You can achieve highest variability in X
(that's PCA/PCR), highest predictive ability on Y (that's OLS) or you
can be in the middle -- PLS -- which is a 50-50 tradeoff between X and
Y. (And yes, there are algorithms that let you choose any other
tradeoff between X and Y.)

With one response, a linear combination of the X's is found that is
most highly correlated (technical note, highest squared covariance,
not highest correlation) with the Y values. This usually is *not* the
linear regression solution; therefore the PLS regression is biased,
but one of the advantages is that the predicted values and the PLS
regression coefficients usually have *much* lower MSE than comparable
regression equations. The lower MSE comes from much lower variances of
the estimators, offseting the increased bias. (Frank, I. and Friedman,
J.H. (1993). A statistical view of some chemometrics regression tools
(with discussion), Technometrics 35(2), 109-148.)

The OLS regression answer is related to PLS solutions (and Principal
Components Regression solutions) as follows: if there are a maximum of
N possible dimensions that can be fit via PLS or PCR, then fitting a
PLS (PCR) model using all N dimensions should be mathematically equal
to the OLS regression solution (except for roundoff errors). Since PLS
typically user fewer than N dimensions, let's say it uses k < N
dimensions, what does that mean? Where does the reduction of variance
come from? The unused N-k dimensions in either PLS or PCR represent
dimensions of X that have low variability. The k dimensions used in
either PLS or PCR have high variability in the X space, or widely
spread out in X space. As we all know regressions are more stable
(lower variability) when the X values are far apart. The unused N-k
dimensions are lower variability, or not widely spread out in the X
space, and as we all know, regressions are less stable (higher
variability) when the X values are close together. So, PLS uses those
dimensions to predict Y that are widely spread in X, thus low
variability. OLS, which uses all N dimensions of X, so predictions
using the dimensions that are widely spread in the X space *and* those
that are not widely spread in the X space -- OLS has higher
variability because it uses those dimensions of X that are not widely
spread out in the X space.

> If there are many responses, I get the impression that PLS is canonical
> correlation like, which adds the complexity of figuring out what,
> exactly, is being predicted, that is, how the linear combination of
> outcomes might bear on the problem.

Yes, when there are 2 or more responses, you get vectors in the X
space that are highly predictive (technical note: highest squared
covariance) of vectors in the Y space. Similar to canonical
correlation. Consider univariate correlation -- no concept of
predictor (independent) or response (dependent) variables is required
for univariate correlation to be meaningful. A multivariate analog in
this situation is canonical correlation. In the univariate case, when
one variable is a predictor and another variable is the response, you
have OLS regression. When there are multiple predictors and multiple
responses, a multivariate analog is PLS. (There are obviously many
multivariate analogs.)

> There may well be a thousand examples of published papers in which PLS
> is applied successfully to spectroscopic data. I don't know. As I began
> my response to Greg, I know nothing about the analysis of spectra so am
> not in a position to say how they should be analyzed or judge whether
> the applications were successful.
>
> At the risk of making another generalization (risky because no
> generalization can be applied uniformly), in the fields where I work, if
> a researcher has to resort to purely statistical techniques to come up
> with response or prediction scales, s/he probably hasn't thought hard
> enough about what s/he is trying to accomplish.
>
> Spectra could be a completely different thing. I have to rely on those
> who are familiar with their analysis to decide.
>
> Thanks for the input, though. While my current impression is that PLS
> is a kind of "voodoo statistics", I'm willing to be convinced otherwise.

Okay, think about the situation where there are many highly correlated
X variables (such as spectroscopy, where the correlations are often
0.99 between variables). Another example is manufacturing data, where
hundreds or thousands of process variables are collected while
something is being manufactured. The correlations may not be so high
as spectroscopy, but you still have hundreds of highly correlated
variables. (See Kresta, J. V., MacGregor, J. F. and Marlin, T. E.
(1991). “Multivariate Statistical Monitoring of Process Operating
Performance”, Canadian Journal of Chemical Engineering, 69, 35–47 for
a good intro to the manufacturing case)

Let's say you have 1000 highly correlated X spectroscopic variables.
PLS is well designed to handle the case where there are highly
correlated X variables. Why? Because it figures out that there really
are not 1000 phenomena that are present. If figures out that there are
really, for example, 4 dimensions that are predictive, not 1000
individual variables. So PLS says that we only have these 4 things
that are predictive, 4 dimensions in the X space. And what are these 4
dimensions? Can they be interpreted? Very often they can be. This is a
huge advantage to the PLS method. In spectroscopy, the dimensions
often correspond to different parts of the spectrum. Scientists
working on this problem usually need to know which parts of the
spectrum are "active", and that information falls right out of PLS, as
easy as pie. PLS also considers the remaining 996 dimensions to be
"noise", or perhaps a better way to say it is variability in X
unrelated to Y. This is also useful to know. And I contend (as do many
people) that valuable answers obtained this way via PLS from these
1000 variables cannot be obtained from OLS methods.

Hope this helps. Ask more if its not clear.

Graham Jones

unread,
Feb 12, 2006, 11:26:43 AM2/12/06
to
In article <1139680297....@g43g2000cwa.googlegroups.com>, Greg
Heath <he...@alumni.brown.edu> writes

>Nevertheless, how would you advise someone with n =158 cases of
>stellar spectra with measurements at p = 2001 wavelengths? The
>objective is to design a logistic model to predict one of 5 luminosity
>classes.

Hope this helps:

Bor-Chen Kuo and David Landgrebe, Improved Statistics Estimation And
Feature Extraction
For Hyperspectral Data Classification, PhD Thesis and School of
Electrical & Computer
Engineering Technical Report TR-ECE 01-6, December 2001 (88 pages)

Available for download in pdf format from
http://dynamo.ecn.purdue.edu/~landgreb/publications.html.


--
Graham Jones
http://www.visiv.co.uk
Emails to gra...@visiv.co.uk may be deleted as spam
Please add a j just before the @ to ensure delivery

Data Matter

unread,
Feb 12, 2006, 12:10:38 PM2/12/06
to

On that note, would Bob care to elucidate the relationship between the
elbow rule and the various "information criteria" (AIC, BIC, etc.)
which also attempt to find a balance between p and some measure of fit
(deviance)?

Reef Fish

unread,
Feb 12, 2006, 2:38:44 PM2/12/06
to

Jerry Dallal wrote:
> Reef Fish wrote:
> > Jerry Dallal wrote:
> >> Reef Fish wrote:
> >>> Jerry Dallal wrote:
> >>>> Reef Fish wrote:
>
> > Look at your statement purely from the point of view of an English
> > sentence, stating your point about "research quetions" <I merely
> > masked some qualifying words>
> >
> > JD> Had someone come to you with a data from <.> certain
> > JD> research questions about <.>, you would not be instructing
> > JD> instructing them on "Tips on Model Building via Multiple
> > Regression.
> > JD> Part I (The Elbow Rule)".
> >
> > That's why I said something to THIS effect,
> >
> > "Why not? Those examples I cited and other projects in my Data
> > Analysis courses were ALL real data from <student-selected>
> > research questions about <various predictive projects>, why
> > would I NOT be instructing them on "Tips on ..." as I did?"
> >
>
> Bob,
>
> You surprise me! In many of your posts you claim that others distort
> your meaning and therefore choose to print things intact. Yet, in my
> post, the omitted "certain qualifiers" change the sense entirely.

Jerry,

My reply to you left EVERY WORD of what you said intact, and
gave you my reply based on what you said (with all your words intact)
and explained WHY I read it the way I did.

You snipped this paragraph that was IMMEDIATELY before the
paragraph you cited, which I was merely re-expressing what I said
from the point of view of an English sentence.

RF> It's the implication of your statememts that carried your
UNINTENDED
RF> meaning (as you say now) -- but I have to stand on WHY I read it
the
RF> way I did, as I had explained in my post. While English is not
my
RF> native language, I am quite certain in this case your written
language
RF> betrayed your intention.

RF> I'll have to let that stand.

In other words, my interpretation of what you said, with what said
completely intact, stands. You are commenting OUT OF CONTEXT
now!


> I wrote, "Had someone come to you with a data *from a designed study
> and* certain research questions about *causality* to address,..."
> (*=restored qualifier)
>
> If you are telling me that you would give the "Elbow Rule" lecture to
> someone coming to you with "data from a designed study and certain
> research questions about causality to address", then I ACCEPT YOUR
> CRITICISM UNCONDITIONALLY. (And I will have food for thought, because it
> is something I do not see myself doing.) All you have to do is state
> unconditionally that you would, in fact, offer the "Elbow Rule" lecture
> to someone coming to you with "data from a designed study and certain
> research questions about causality to address" and this discussion is
> over.

There is nothing conditional or unconditional about the "Elbow Rule".
The Elbow Rule may or may not be applicable.

The RELEVANT comment of mine in response to your "research
questions" to mean ONLY "research questions that involves a designed
study involving causality"

was this:

RF> What you really meant was "research questions about causality based

RF> on a designed study".

RF> Then, that's an entirely different setting for a discussion. It's
no
RF> longer even in the realm of a discussion in REGRESSION -- because
RF> that may not even be appropriate or applicable!


> (It should be over anyway. As long as you let my comments stand
> unedited, as you request of others, I have nothing to add.

I have shown ABOVE: What you've posted this time complaining
about my edit of what you said was completely OUT OF CONTEXT!

I have given my RELEVANT responses (which you overlooked)
relative to your "intact" statement and your "intact" question about
the "Elboe Rule".

Jerry, this is not the first time, nor will it be the last (I am quite
sure) when your use of the English language and your inattention
to some of the details IN CONTEXT were your problems, not
so much over any of the substance of discussion about STATISTICS.

In short, using your words INTACT,

RF> What you really meant was "research questions about causality based

RF> on a designed study".

Then your question and comment about the "Elbow Rule" was
entirely inappropriate AND out of context, BECAUSE:

RF> Then, that's an entirely different setting for a discussion. It's
no
RF> longer even in the realm of a discussion in REGRESSION -- because
RF> that may not even be appropriate or applicable!


I can't make it any clearer or precise, IN CONTEXT, and the summary
above was completely consistent with everything I said commenting
to your nebulously stated (again, a problem in English) statement)
about

JD> tackling the problem from a different angle.

which turned out to be a DIFFERENT problem, even outside the
realm of a REGRESSION discussion in general (my posts in
particular), while irrelevantly questioning the (possible)
inapplicability of the "Elbow Rule", addressing RESEARCH
questions confined to PREDICTIVE models.

-- Bob.

Reef Fish

unread,
Feb 12, 2006, 2:51:24 PM2/12/06
to

Thanks, Jerry. That would be my response, that it is BETTER than
the RSS which is monotonically decreasing, or its equivalent
counterparts R or R^2 which are monotonically increasing.

The turn-up behavior is definitely an important AID to Dummies <tm>
like a sign posted at the edge of a cliff "NEVER step beyond this
point". Non-Dummies would have stopped long before the
warning sign.

Furthermore, all of the MEANINGFUL consideration of the PRACTICAL
significance of predictive regression are the Prediction Intervals
which
are always in units of s, not RSS.

-- Bob.

Reef Fish

unread,
Feb 12, 2006, 3:18:04 PM2/12/06
to

Greg Heath wrote:
> Reef Fish wrote:
>
> -----SNIP
>
> Bob,
>
> My "how to" interpretation of your post is summarized below:
>
> 1. Select an appropriate fitting criterion, FC, to minimize
> 2. Obtain the best model for each value of q (q = 1,2...p)

The assumed model is OLS under standard error assumptions.

> 3. If p <= pthresh (say 10) use all possible regressions

That's a reasonable Rule of Thump, because most computers
can easily handle the size and cpu constraints of such small p.

> 4. If p > pthresh, use backward elimination

No quite. Regardless of the size of p, BACK is preferable to
FORW and STEP. For large p, backward elimination may
not be possible even if the storage size constraint allows it.
The X'X matrix may be so ill-conditioned for the FULL model
because of the over-abundance of redundancies, one may
not be able to START the backward elimination process
until a sufficiently small size full model is used to start.

> 5. Plot FC vs q and apply the elbow rule.

OLS's FC is Least Squares of residuals, or RSS. I
recommend (see follow-up to Jerry's reply to Joe Jansen.


> Now, a few questions:
>
> 1. What FC do you recommend?

See 1 above.

> 2. Is it worthwhile to use backward stepwise (allowing
> a rejected variable to reappear)?

Almost always No.

> 3. What do you recommend if p is too large to consider
> obtaining p models via a backward search?

Do some preliminary investigation about what variables
to include. I'll do that in Part II (which I've decided to
call it the "Elephant Rule", which I attribute to L J Savage,
my statistical mentor, who once said to me, in his
characteristic colorful way, "If you give me 3 variables,
I can fit an elephant", In that respect, 10 is an
astronomically LARGE number of variables to use
in any regression problem, for many different reasons.


> 4. What do you recommend if p > n and additional
> observations are not available?

See 3. Or use some good commonsense and judgment.
As Jerry asked thetorically about throwing away data,"what
if I asked for X1, and the measuring instrument returned
"X1-X97?" My short answer was, "throw away X2-X97".
>
> Hope this helps.

Hope my answers help, at least for now.

-- Bob.

Reef Fish

unread,
Feb 12, 2006, 3:51:15 PM2/12/06
to

Data Matter wrote:
> Greg Heath wrote:
> > Reef Fish wrote:
> >
> > -----SNIP
> >
> > Bob,
> >
> > My "how to" interpretation of your post is summarized below:
> >
> > 1. Select an appropriate fitting criterion, FC, to minimize
> > 2. Obtain the best model for each value of q (q = 1,2...p)
> > 3. If p <= pthresh (say 10) use all possible regressions
> > 4. If p > pthresh, use backward elimination
> > 5. Plot FC vs q and apply the elbow rule.
> >
> > Now, a few questions:
> >
> > 1. What FC do you recommend?
> > 2. Is it worthwhile to use backward stepwise (allowing
> > a rejected variable to reappear)?
> > 3. What do you recommend if p is too large to consider
> > obtaining p models via a backward search?
> > 4. What do you recommend if p > n and additional
> > observations are not available?

I have just finished responding to Greg's questions -- which I
consider very good ones that both clarified what I meant (and
why) as well as the reasons for my answers to his questions.


> >
>
> On that note, would Bob care to elucidate the relationship between the
> elbow rule and the various "information criteria" (AIC, BIC, etc.)

In a word: Aaaaaaaaaaaaaaaaaarrrrrrrrrrrrrggggghhh!

> which also attempt to find a balance between p and some measure of fit
> (deviance)?

AIC, BIC are acronyms for Akaike Informaton Criteria and
Bayesian Information Criteria, used mostly outside the
discipline of statistics.

AIC is also used in a non-statistical context on a web page as
" the Aiaike Information Crite-. rion (AIC) is the most biased
criterion,"

Akaike has attracted a cult of followers with his mystque
as the Maharishi Mahesh Yogi has attracted his followers in
Transcendental Meditation techniques and how to say
"Ooommmmm:,.

AFAIK, neither has much relevance to the standard OLS
Multiple Regression problem in statistical model building
and variable selection problem.

-- Bob.

Reef Fish

unread,
Feb 12, 2006, 7:11:47 PM2/12/06
to
A serious typo/omission correction.

Reef Fish wrote:
> Greg Heath wrote:
> > Reef Fish wrote:
> >
> > -----SNIP
> >
> > Bob,
> >
> > My "how to" interpretation of your post is summarized below:
> >
> > 1. Select an appropriate fitting criterion, FC, to minimize
> > 2. Obtain the best model for each value of q (q = 1,2...p)
>
> The assumed model is OLS under standard error assumptions.
>
> > 3. If p <= pthresh (say 10) use all possible regressions
>
> That's a reasonable Rule of Thump, because most computers
> can easily handle the size and cpu constraints of such small p.
>
> > 4. If p > pthresh, use backward elimination
>
> No quite. Regardless of the size of p, BACK is preferable to
> FORW and STEP. For large p, backward elimination may
> not be possible even if the storage size constraint allows it.
> The X'X matrix may be so ill-conditioned for the FULL model
> because of the over-abundance of redundancies, one may
> not be able to START the backward elimination process
> until a sufficiently small size full model is used to start.
>
> > 5. Plot FC vs q and apply the elbow rule.
>
> OLS's FC is Least Squares of residuals, or RSS. I
> recommend (see follow-up to Jerry's reply to Joe Jansen.

I recommended s (see my follow-up to Jerry's reply to Joe Jansen)
vs q as the better plot to use, even though s is NOT the fitting
criterion, FC, which is RSS.

Thom

unread,
Feb 13, 2006, 6:54:14 AM2/13/06
to
Reef Fish wrote:
> AIC, BIC are acronyms for Akaike Informaton Criteria and
> Bayesian Information Criteria, used mostly outside the
> discipline of statistics.
>
> AIC is also used in a non-statistical context on a web page as
> " the Aiaike Information Crite-. rion (AIC) is the most biased
> criterion,"
>
> Akaike has attracted a cult of followers with his mystque
> as the Maharishi Mahesh Yogi has attracted his followers in
> Transcendental Meditation techniques and how to say
> "Ooommmmm:,.
>
> AFAIK, neither has much relevance to the standard OLS
> Multiple Regression problem in statistical model building
> and variable selection problem.
>
> -- Bob.

That seems a bit OTT given that AIC, BIC etc. are just log-likelihood
with a penalty for number of parameters (and maybe N).

They are extremely useful in some contexts - I'm thinking particularly
of comparing competing models. I've also seen it argued that using AIC
or BIC to produce a weighted average of models produces better
parameter estimates and (presumably) better prediction.

Thom

Greg Heath

unread,
Feb 13, 2006, 8:16:27 AM2/13/06
to

Reef Fish wrote:
------SNIP

> Furthermore, all of the MEANINGFUL consideration of the PRACTICAL
> significance of predictive regression are the Prediction Intervals
> which are always in units of s, not RSS.

Thanks!

That statement is probably the most enlightening (to me) of this whole
discourse.

Greg

Greg Heath

unread,
Feb 13, 2006, 8:47:42 AM2/13/06
to

I'm surprised at this reply since the example in your reference

http://www.itc.virginia.edu/research/talks/sa01_05.pdf

is offered as support to the assertion that AIC is better than
SSE and RMSE = sqrt(SSE/(n-k)) for variable selection.

Hope this helps.

Greg

Reef Fish

unread,
Feb 13, 2006, 9:51:48 AM2/13/06
to

I gave that reference NOT as an endorsememnt of AIC -- if it
had the merit, I wouldn't have been discussing the Elbow Rule
which does NOT try to optimize anything, but used as a tool
to judge HOW to choose from many NON-optimal models of
RSS (the criterion of Least Squares). In fact, the minimum
RMS would have already gone MUCH too far in the model
selection process.


>
> is offered as support to the assertion that AIC is better than
> SSE and RMSE = sqrt(SSE/(n-k)) for variable selection.

> Hope this helps.

> Greg

Not much, I am afraid.

That's the opinion of cult leader Akaiki. I know the other cited
author in the SAS computer document reference well, (Ham)
Bozdogan, through our encounters in several years as members
of the Classification Society. CSNAB).

Suffice to say that I was the Program Chairman of the First
Annual Meeting of the IFCS (International Federation of
Classification Societies) of 7 international societies, which
was held in Bozdogan's home institution, Virginia, in 1989.

Bozdogan was denied the Program Chairmanship of that
Meeting and the Editorship of the Proceedings, both of which
he wanted very badly, Both denials were by the President
of the IFCS, unanamimously supported by the President and
Council of the CSNAB (Bob Sokal, author of the well-known
book "Numerical Taxonomy" in classification and clustering)
and offered Bozdogan only the Local Arrangement position.

Bozdogan's unprofessional reaction and antics behind the
scene of that Meeting did not make any friends with members
of the IFCS or CSNAB, nor did he impress anyone in terms
of his statistical or clustering capacity or opinion..

Hope that helps.

-- Bob,

Greg Heath

unread,
Feb 13, 2006, 11:27:53 AM2/13/06
to

Reef Fish wrote:
-----SNIP

> The Elbow Rule, applied to a Multiple Regression problem
> with "standard deviation of residuals"

s = sqrt(SSE/(n-k)), (k = 1,2,...p) ?

> in the vertical axis
> versus the number of predictor variables in the horizontal
> axis turns out to be an immensely useful tool to help
> answer any of the questions I posed earlier.
>
> In Multiple Regression, the plot can actually INCREASE
> after it reached some low point. What that means is
> that by using an additional parameter, the sum of
> squares of the residuals may not decrease enough to
> compensate for the reduction of the denominator from
> n to (n-1)

n to n-k ?

> to make the standard deviation of residuals
> to increase. That's the point beyond which I call the
> "forbidden reagion" for fitting -- because it is an
> unmistakable sign of "over-fitting".

Hope this helps.

Greg

Reef Fish

unread,
Feb 13, 2006, 12:05:34 PM2/13/06
to

Greg Heath wrote:
> Reef Fish wrote:
> -----SNIP
>
> > The Elbow Rule, applied to a Multiple Regression problem
> > with "standard deviation of residuals"
>
> s = sqrt(SSE/(n-k)), (k = 1,2,...p) ?

Yes, with the slight ambiguity of k whether it counts only the number
of predictor X's and a constant term is fitted -- in the latter case
the
denominator for each case would be (n-k-1).


>
> > in the vertical axis
> > versus the number of predictor variables in the horizontal
> > axis turns out to be an immensely useful tool to help
> > answer any of the questions I posed earlier.
> >
> > In Multiple Regression, the plot can actually INCREASE
> > after it reached some low point. What that means is
> > that by using an additional parameter, the sum of
> > squares of the residuals may not decrease enough to
> > compensate for the reduction of the denominator from
> > n to (n-1)
>
> n to n-k ?

There, the n was obviously not the sample size n, but the
denominator df. Perhaps I should have used df to (df -1)
to say for each additional X brought into the model, the
degree of freedom is decreased by 1.

So, for the number of parameters k ranging from 1 to p.
the denominators of s^2 would have been

(n-2), (n-3), ... (n - p - 1).


> > to make the standard deviation of residuals
> > to increase. That's the point beyond which I call the
> > "forbidden reagion" for fitting -- because it is an
> > unmistakable sign of "over-fitting".

As I had said in a related post, it's this turn-up behavior
that serves as a WARNING sign posted in front of a
cliff to warn those who have already gone too far.

If one were to OPTIMIZE the RMS, one would not have
stopped until that lowest point -- which was already much
too far, in the light of the "Elephant Rule".

Hope that help you put together the idea about why AIC is
no good as an empirical model-building tool, while the
"Elbow Rule" is a much better one.

BTW, regarding that SAS webpage that recommended the
AIC, I should have pointed out I referenced that simply as
an EXAMPLE of the usage of search algorithms, as well as
several alternative "performance measures" against which
the dimension p can be plotted against.

-- Bob.

Reef Fish

unread,
Feb 13, 2006, 12:36:34 PM2/13/06
to

Thom wrote:
> Reef Fish wrote:
> > AIC, BIC are acronyms for Akaike Informaton Criteria and
> > Bayesian Information Criteria, used mostly outside the
> > discipline of statistics.
> >
> > AIC is also used in a non-statistical context on a web page as
> > " the Aiaike Information Crite-. rion (AIC) is the most biased
> > criterion,"
> >
> > Akaike has attracted a cult of followers with his mystque
> > as the Maharishi Mahesh Yogi has attracted his followers in
> > Transcendental Meditation techniques and how to say
> > "Ooommmmm:,.
> >
> > AFAIK, neither has much relevance to the standard OLS
> > Multiple Regression problem in statistical model building
> > and variable selection problem.
> >
> > -- Bob.
>
> That seems a bit OTT given that AIC, BIC etc. are just log-likelihood
> with a penalty for number of parameters (and maybe N).

Thom,

Your follow-up post almost fell through the crack because I am
reading the threads from the google archives which shows only the
LATEST posts and one would have to go back to the chronological
order all posts to see the entire history.

BIC is clearly not appropriate because there's nothing that is
technically Bayesian in the OLS fit.

AIC is formally related to the Kullback-Leibler information number
and the log-likelihood functons between two f(y) and g(y), neither
of which corresponds to the OLS solution in the ordinary regression.

http://tinyurl.com/a5cm6


> They are extremely useful in some contexts - I'm thinking particularly
> of comparing competing models.

Yes, in the context of competing models of entirely different form,
such as the dozen or so different criteria that COULD be used to
find a regression model, OLS, MLE, LAD, robust, nonlinear, etc.

That gets far afield from the same linear model under the OLS
criterion, on the number of TERMS to use in the predictor model.
among the same set of predicting candidate variables.


> I've also seen it argued that using AIC
> or BIC to produce a weighted average of models produces better
> parameter estimates and (presumably) better prediction.
>
> Thom

:Better" is an undefined term in your usage. A better MLE solutioni
or a better MSE solution will not be a "better" Least Squares solution,
by merely noting the criterion of estimation.

The AIC is more a mathematical statistics toy that generates some
papers for the academician, than a truly practical method of addressing
the usefulness of a fitted model,

That's my opinion on the matter of course.

-- Bob.

Anon.

unread,
Feb 13, 2006, 1:12:23 PM2/13/06
to
*sigh* Just assume normality....

To bring it back to the original post (the suggestion of plotting the
standard deviation of residuals against p and looking for a
change-point), the decrease in s after the changepoint is just due due
to random associations with variables. If we could find a relationship
for how large this change is (i.e. hte slope of the relationship with
p), then we could add that to s, and find the minimum. i.e. we penalise
model fit with a complexity term. Of course, this is what AIC and BIC
do, but they use s^2 rather than s, and they differ in how large they
think the term should be.

Bob

--
Bob O'Hara
Department of Mathematics and Statistics
P.O. Box 68 (Gustaf Hällströmin katu 2b)
FIN-00014 University of Helsinki
Finland

Telephone: +358-9-191 51479
Mobile: +358 50 599 0540
Fax: +358-9-191 51400
WWW: http://www.RNI.Helsinki.FI/~boh/
Journal of Negative Results - EEB: www.jnr-eeb.org

Reef Fish

unread,
Feb 13, 2006, 2:01:54 PM2/13/06
to

No. That's not even the salient point. < sigh >

>
> To bring it back to the original post (the suggestion of plotting the
> standard deviation of residuals against p and looking for a
> change-point), the decrease in s after the changepoint is just due due
> to random associations with variables. If we could find a relationship
> for how large this change is (i.e. hte slope of the relationship with
> p), then we could add that to s, and find the minimum. i.e. we penalise
> model fit with a complexity term. Of course, this is what AIC and BIC
> do, but they use s^2 rather than s, and they differ in how large they
> think the term should be.
>

> Bob O'Hara
> Department of Mathematics and Statistics
> P.O. Box 68 (Gustaf Hällströmin katu 2b)
> FIN-00014 University of Helsinki
> Finland

Anon Bob O'Hara,

You are indeed true to form in your post. Your paragraph merely
showed that missed ALL the important points I've discussed. Why
s is better than s^2, why the "Elbow Rule" does something entirely
different from the AIC and BIC, why the slope (of the piecewise
linear plot) was not not even explicitly discussed, and why finding
the MINIMUM (of anything) is NOT the primary objective.

In short, Bob O'Hara, you missed EVERYTHING in this discussion
of empirical model building in Applied Statistics, and confused it
with some irrelevant mathematical statistics terms and considerations.

-- Reef Fish Bob.
discussed

Anon.

unread,
Feb 13, 2006, 2:35:44 PM2/13/06
to
Huh? It's not salient that OLS is _identical_ to ML when normality is
assumed? And hence K-L is a measure of the distance between two OLS fits?

>>To bring it back to the original post (the suggestion of plotting the
>>standard deviation of residuals against p and looking for a
>>change-point), the decrease in s after the changepoint is just due due
>>to random associations with variables. If we could find a relationship
>>for how large this change is (i.e. hte slope of the relationship with
>>p), then we could add that to s, and find the minimum. i.e. we penalise
>>model fit with a complexity term. Of course, this is what AIC and BIC
>>do, but they use s^2 rather than s, and they differ in how large they
>>think the term should be.
>>
>>Bob O'Hara
>>Department of Mathematics and Statistics
>>P.O. Box 68 (Gustaf Hällströmin katu 2b)
>>FIN-00014 University of Helsinki
>>Finland
>
>
> Anon Bob O'Hara,
>
> You are indeed true to form in your post. Your paragraph merely
> showed that missed ALL the important points I've discussed. Why
> s is better than s^2, why the "Elbow Rule" does something entirely
> different from the AIC and BIC, why the slope (of the piecewise
> linear plot) was not not even explicitly discussed, and why finding
> the MINIMUM (of anything) is NOT the primary objective.
>
> In short, Bob O'Hara, you missed EVERYTHING in this discussion
> of empirical model building in Applied Statistics, and confused it
> with some irrelevant mathematical statistics terms and considerations.
>

Could you pleqase explain why I'm so wrong, rather than just stating it.
I was simply making the point that the *IC approach is a formalisation
of what you were suggesting. The difference is that there is some
theory behind AIC, whereas your suggestion is _ad hoc_. Now, there may
be problems with the theory, in which case I would ask you to show where
the problems are. It may also be that what you are suggesting works
better tha AIC, BIC etc. In that case, I would hope that you could
demonstrate its superiority.

Incidentally, AIC is not irrelevant to applied statistics: it's actually
used in applied statistics. Look at the citations of Burnham and
Anderson's book:
<http://scholar.google.com/scholar?q=author%3Aburnham+author%3Aanderson&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search>

Bob

--

Bob O'Hara
Department of Mathematics and Statistics
P.O. Box 68 (Gustaf Hällströmin katu 2b)
FIN-00014 University of Helsinki
Finland

Telephone: +358-9-191 51479

Reef Fish

unread,
Feb 13, 2006, 3:20:03 PM2/13/06
to

Anon. Bob O'Hara wrote:

> Could you pleqase explain why I'm so wrong, rather than just stating it.

I already did, even in the post your cited:

> > You are indeed true to form in your post. Your paragraph merely
> > showed that missed ALL the important points I've discussed. Why
> > s is better than s^2, why the "Elbow Rule" does something entirely
> > different from the AIC and BIC, why the slope (of the piecewise
> > linear plot) was not not even explicitly discussed, and why finding
> > the MINIMUM (of anything) is NOT the primary objective.
> >
> > In short, Bob O'Hara, you missed EVERYTHING in this discussion
> > of empirical model building in Applied Statistics, and confused it
> > with some irrelevant mathematical statistics terms and considerations.


Just go back and re-read what I have posted that explained all those
WHYS.

I think you are suffereing from the same problem as you had in the
Linear Models thread. Your inability to look up posts in the archives
to find out what had already been posted.

This time I am NOT going to repeat what I had already explained,
unless someone cited my explanation and asked for clarification
of further details.

-- Reef Fish Bob.

Anon.

unread,
Feb 14, 2006, 2:18:09 AM2/14/06
to
Reef Fish wrote:
> Anon. Bob O'Hara wrote:
>
>
>>Could you pleqase explain why I'm so wrong, rather than just stating it.
>
>
> I already did, even in the post your cited:
>
OK, let's look for the explanation (all quotes are from messages from
Reef Fish):

>
>>>You are indeed true to form in your post. Your paragraph merely
>>>showed that missed ALL the important points I've discussed. Why
>>>s is better than s^2,

Non. Just a statement. The only explanation is this (from 12/02/2006
22:51):


"Furthermore, all of the MEANINGFUL consideration of the PRACTICAL
significance of predictive regression are the Prediction Intervals
which are always in units of s, not RSS."

which is false (what about R^2?).

So, please explain in more detail.

why the "Elbow Rule" does something entirely
>>>different from the AIC and BIC,

Again, a statement, no explanation.
There's one from 13/02/2006 16:51:


"I gave that reference NOT as an endorsememnt of AIC -- if it
had the merit, I wouldn't have been discussing the Elbow Rule
which does NOT try to optimize anything, but used as a tool
to judge HOW to choose from many NON-optimal models of
RSS (the criterion of Least Squares). In fact, the minimum
RMS would have already gone MUCH too far in the model
selection process."

So, the only think it does differently is not to optimise, but to filter
out badly fitting models. BUT in the original post we see this:
"If the number of predictor candiate is small to moderate,
such as 10, doing ALL possible subsets and then plot
the s (SE) of the best one or two of each dimension on
the s versus k (number of indep vars) plot will enable one
to use the Elbow Rule, knowing the actual "best fit"
inhttp://www.itc.virginia.edu/research/talks/sa01_05.pdf
each dimension, while allowing a SUBJECTIVE override
of choosing some comtination which doesn't fit best, but
have much better behavior of the residuals, or some
other external criteria to choose as the "best" fitted model
to use."

which is talking about finding the "best" fitted model to use. Is not
finding the "best" surely an optimisation process?

Incidentally, there are also guidelines about what the size of the
difference between AICs means in terms of the model fit, so it's
perfectly possible (and indeed advisable!) to use AIC to find several
adequate models, and then choose the best based on other criteria (e.g.
after residual checking).

why the slope (of the piecewise
>>>linear plot) was not not even explicitly discussed,

Well, you couldn't discuss that. But I appreciate that you did discuss
the turn-up etc. I was discussing the slope to try and give some
insight into the similarity between your method and AIC-type methods.

and why finding
>>>the MINIMUM (of anything) is NOT the primary objective.
>>>

Although finding the "best" apparently is: isn't that a minimisation of
inadequacy? I'm not sure why you feel the need to point this out: I was
not advocating that it was.

>>>In short, Bob O'Hara, you missed EVERYTHING in this discussion
>>>of empirical model building in Applied Statistics, and confused it
>>>with some irrelevant mathematical statistics terms and considerations.
>
>
>
> Just go back and re-read what I have posted that explained all those
> WHYS.
>

Done.

> I think you are suffereing from the same problem as you had in the
> Linear Models thread. Your inability to look up posts in the archives
> to find out what had already been posted.
>

Oh, I had read through. Now would you mind actually answering my post,
and actually show that your approach is/can be better than using AIC as
a formal criterion of model adequacy.

I was trying to point out that what you're suggesting is close to what
is done anyway in modern applied statistics, but that it's now more
formal. I hope you appreciate that your method is more subjective, in
particular it assumes that there _is_ a change-point. Using tools like
AIC helps when you have messier sutiations where there is no
change-point. Or where there are several near-optimal models of
differing dimension: something not uncommon.

Reef Fish

unread,
Feb 14, 2006, 4:02:09 AM2/14/06
to

Anon. wrote:
> Reef Fish wrote:
> > Anon. Bob O'Hara wrote:
> >
> >
> >>Could you pleqase explain why I'm so wrong, rather than just stating it.
> >
> >
> > I already did, even in the post your cited:
> >
> OK, let's look for the explanation (all quotes are from messages from
> Reef Fish):

Good. Thank you. You should do that everytime. It's make it easy for
everyone.

> >
> >>>You are indeed true to form in your post. Your paragraph merely
> >>>showed that missed ALL the important points I've discussed. Why
> >>>s is better than s^2,
>
> Non. Just a statement. The only explanation is this (from 12/02/2006
> 22:51):
> "Furthermore, all of the MEANINGFUL consideration of the PRACTICAL
> significance of predictive regression are the Prediction Intervals
> which are always in units of s, not RSS."
> which is false (what about R^2?).
>
> So, please explain in more detail.

The confidence and prediction intervals (of future observations)
are always of the form: point est. +- (conf. coefficient) * SE
(point est).
The SE of the point estimate are in units of the point estimate and s,
not s^2. R^2 is worthless when it comes to PRACTICAL usefulness.

Review the SPSS Manual example. All the R, R^2 are well over 0.9,
but the prediction intervals were completely useless. This was the
elementary stuff that was covered LAST YEAR, in relation to that
example.

>
> why the "Elbow Rule" does something entirely
> >>>different from the AIC and BIC,
>
> Again, a statement, no explanation.

The explanation was right THERE. You just missed it!

> There's one from 13/02/2006 16:51:
> "I gave that reference NOT as an endorsememnt of AIC -- if it
> had the merit, I wouldn't have been discussing the Elbow Rule
> which does NOT try to optimize anything, but used as a tool
> to judge HOW to choose from many NON-optimal models of
> RSS (the criterion of Least Squares). In fact, the minimum
> RMS would have already gone MUCH too far in the model
> selection process."
>

In Empirical Model Building, we are NOT seeking the "optimal"
solution in the mathematical sense. That's why we stop LONG
before we get to the minimum point of s, and I even said sometimes
you would prefer a model with the SAME number of parameters
and with an s that is larger (hence inferior from a mathematical
view point) than one which is smaller. Because this NON-optimal
model (even among the others with the same number of
variables) may behave better in some other way, such as in the
residuals or some other "external" criteria. I used the term
"external"
in that discussion, I am sure. In fact it's in the paragraph BELOW
that you cited.

Anon, Bob O'Hara. These are basic ideas in Data Analysis, It is
NOT a mathematical problem always looking for optimal solutions
in some mathematical criterion. It is in THAT sense that Akaike's
AIC is completely useless.


> So, the only think it does differently is not to optimise, but to filter
> out badly fitting models. BUT in the original post we see this:
> "If the number of predictor candiate is small to moderate,
> such as 10, doing ALL possible subsets and then plot
> the s (SE) of the best one or two of each dimension on
> the s versus k (number of indep vars) plot will enable one
> to use the Elbow Rule, knowing the actual "best fit"
> inhttp://www.itc.virginia.edu/research/talks/sa01_05.pdf
> each dimension, while allowing a SUBJECTIVE override
> of choosing some comtination which doesn't fit best, but
> have much better behavior of the residuals, or some
> other external criteria to choose as the "best" fitted model
> to use."

This cited paragraph is somewhat garbled in the middle.
but the explanation (which I re-iterated above) was all there!


> which is talking about finding the "best" fitted model to use. Is not
> finding the "best" surely an optimisation process?

That is correct. The "best" fitted model to use may be the 10 th
best in s, or 115th best in SSE (the Least Square Criterion) or
900th best in the AIC.

Perhaps that makes the point for you and a few others who are
accustomed to the "optimization" problem in the mathematical
sense while losing all senses in the PRACTICAL significance
sense.

>
> Incidentally, there are also guidelines about what the size of the
> difference between AICs means in terms of the model fit, so it's
> perfectly possible (and indeed advisable!) to use AIC to find several
> adequate models, and then choose the best based on other criteria (e.g.
> after residual checking).

It's certain possible, as YOUR external criterion. But that's one
which
I (and most other Data Analyst I know) would not touch with a 10-foot
pole.


>
> why the slope (of the piecewise
> >>>linear plot) was not not even explicitly discussed,
>
> Well, you couldn't discuss that. But I appreciate that you did discuss
> the turn-up etc. I was discussing the slope to try and give some
> insight into the similarity between your method and AIC-type methods.

Any similarity is entirely accidental and coincidental, because AIC is
a MATHEMATICAL criterion that leave no room for the Elbow Rule
type of common sense and judgment. That's why SAS could even
program it.

Data Analysis is an ART in which no program, even with built-in
artificial intelligence can mimic.

>
> and why finding
> >>>the MINIMUM (of anything) is NOT the primary objective.
> >>>

Because there is no "unique minimum" of anything in the "best"
practical prediction model. It may be a combination of seeking
models with reasonably small "percentage errors" (not in the
criterion of Least Square) as a secondary consideration. The
Data Analyst may week a model with a small "average prediction
intervals or some other easily understood practical criteria
that are NOT part of the formal analysis in search of a model.

This is perhaps the part that is most difficult for mathematical
statisticians to understand -- that Data Analysis is an ART,
requiring much GOOD judgment and sense. Whereas
mathematioal statistics is a branch of mathematics NOT
requiring any practical justification.

For example, the requirememnt of an "unbiased estimate" is the
SILLIEST notion that is left from the "classical statistics".

The s we are taking about and other estimates of standard
deviation are all BIASED. So what? s^2 is unbiased, but
hardly ever used for anything, for good reason! Prediction
intervals are in units of s, not s^2.


> Although finding the "best" apparently is: isn't that a minimisation of
> inadequacy? I'm not sure why you feel the need to point this out: I was
> not advocating that it was.

But you seemed to be preoccupied with the "information statistic"
and the optimization in AIC, which is entirely contrary to the spirit
of


good Data Analysis. That was why i said:

> >>>In short, Bob O'Hara, you missed EVERYTHING in this discussion
> >>>of empirical model building in Applied Statistics, and confused it
> >>>with some irrelevant mathematical statistics terms and considerations.
> >
> >
> > Just go back and re-read what I have posted that explained all those
> > WHYS.
> >
> Done.

Good. BUt apparently you need much more "brain-washing" of the
mathematical-statistics GUNK that had been clogging your
Data Analysis thinking counterpart.


>
> > I think you are suffereing from the same problem as you had in the
> > Linear Models thread. Your inability to look up posts in the archives
> > to find out what had already been posted.
> >
> Oh, I had read through. Now would you mind actually answering my post,
> and actually show that your approach is/can be better than using AIC as
> a formal criterion of model adequacy.

I have now, and had before now. But I wouldn't be surprised if
nothing
gets absorbed. However, I am satisfied that I have given you an
explanation of the same a second time. And that's all the freebie
you are going to get. :-)


Yes Bob, I am afraid you are spinning your mathematical wheel in
the same mud puddle from which climbed out temporarily and
fell right back:


>
> I was trying to point out that what you're suggesting is close to what
> is done anyway in modern applied statistics, but that it's now more
> formal. I hope you appreciate that your method is more subjective, in
> particular it assumes that there _is_ a change-point. Using tools like
> AIC helps when you have messier sutiations where there is no
> change-point. Or where there are several near-optimal models of
> differing dimension: something not uncommon.

There is NO change point. There is a warning sign of a cliff. Using
Akaike's AIC is no better than a drunk using a lampost for support
than for light or enlightment.

-- Reef Fish Bob.

Greg Heath

unread,
Feb 14, 2006, 4:33:00 AM2/14/06
to

Reef Fish wrote:
> Greg Heath wrote:
> > Reef Fish wrote:
> > > Data Matter wrote:
> > > > Greg Heath wrote:
> > > > > Reef Fish wrote:
> > > > >
> > > > > -----SNIP

> > I'm surprised at this reply since the example in your reference


> >
> > http://www.itc.virginia.edu/research/talks/sa01_05.pdf
>
> I gave that reference NOT as an endorsememnt of AIC -- if it
> had the merit, I wouldn't have been discussing the Elbow Rule
> which does NOT try to optimize anything, but used as a tool
> to judge HOW to choose from many NON-optimal models of
> RSS (the criterion of Least Squares). In fact, the minimum
> RMS would have already gone MUCH too far in the model
> selection process.
> >
> > is offered as support to the assertion that AIC is better than
> > SSE and RMSE = sqrt(SSE/(n-k)) for variable selection.

Is your RSS the same as my SSE ( sum squared error) ?
Is your s the same as my RMSE?
What is your RMS? If it is the same as my RMSE and your
s then you are saying that the important point is the elbow
and not the minimum.

Correct?

How do things change when nondesign calibration data are
available to determine p?

Hope this helps.

Greg

Greg Heath

unread,
Feb 14, 2006, 5:03:43 AM2/14/06
to
Reef Fish wrote:
-----SNIP

> Anon Bob O'Hara,
>
> You are indeed true to form in your post. Your paragraph merely
> showed that missed ALL the important points I've discussed. Why
> s is better than s^2, why the "Elbow Rule" does something entirely
> different from the AIC and BIC, why the slope (of the piecewise
> linear plot) was not not even explicitly discussed, and why finding
> the MINIMUM (of anything) is NOT the primary objective.

Usually, the primary objective is to use a finite design sample to
constuct a model that will minimize the population expected value
(PEV) of some evaluation criterion, EC.

The Elbow Rule is used to prevent overfitting caused by trying to
minimize the design set estimate of EC. Typically, the minimum
design set estimate is lower than the PEV and occurs at a value
of p that is greater than that at which the minimum PEV occurs.

What is not clear to me at this time is how things are changed
when a nondesign calibration data set is used to determine p.

Hope this helps.

Greg

Thom

unread,
Feb 14, 2006, 8:53:01 AM2/14/06
to
I left "better" undefined deliberately - because it will depend on what
the context is and what your goal is. I think I understand you point
more fully now. Some of the plus points of AIC etc. are more evident in
testing competing theories, but I can also see the usefulness of it as
a tool to weight models in prediction, given that simply selecting the
best model can lead to bias (at least that's my reading of Burnham and
Anderson's model averaging work):

Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference:
Understanding AIC and BIC in model
1630 selection. Sociological Methods and Research, 33, 261-304

Incidentally, in the same paper they argue that BIC isn't Bayesian - in
that both AIC and BIC can be derived from a Bayesian or non-Bayesian
perspective. They argue the real difference is the assumptions about
the 'correct' model that is being selected.

Thom

Anon.

unread,
Feb 14, 2006, 11:12:12 AM2/14/06
to
Reef Fish wrote:
> Anon. wrote:
>
>>Reef Fish wrote:
>>
>>>Anon. Bob O'Hara wrote:
>>>
<snip>

>>Although finding the "best" apparently is: isn't that a minimisation of
>>inadequacy? I'm not sure why you feel the need to point this out: I was
>>not advocating that it was.
>
>
> But you seemed to be preoccupied with the "information statistic"
> and the optimization in AIC, which is entirely contrary to the spirit
> of
> good Data Analysis. That was why i said:
>
I was only trying to make the point that the information criterion
approach was similar to yours, and explaining why. It provides a
formalisation of the idea you were suggesting. I think that's valuable
because, even if it's wrong, it gives insight into the problem, and
might spur people on to improve it.

There are a couple of advantages of using a statistic like AIC over your
eye-balling method:
1. If there is no "elbow", then your method is useless, whereas AIC will
still provide some information.
2. AIC gives a numerical summary of model adequacy, which makes it
easier to rank and compare models, i.e. to filter out the poorer models.
It's difficult to see how to do that with your method, especially if
the competing adequate models have different dimensionalities.

The frustrating thing is that we both agree that there is more to model
fitting than just the number crunching. I really can't see why you
should have anything against another tool to help the data analysis.
AIC certainly isn't perfect, but at least it's main flaws (it tends to
over-fit) are well known, and can be compensated for. How does your
method behave?

>>>I think you are suffereing from the same problem as you had in the
>>>Linear Models thread. Your inability to look up posts in the archives
>>>to find out what had already been posted.
>>>
>>
>>Oh, I had read through. Now would you mind actually answering my post,
>>and actually show that your approach is/can be better than using AIC as
>>a formal criterion of model adequacy.
>
>
> I have now, and had before now.

You have not given any indication of how well your method will behave in
comparison to others, which is what I would like to see. I want
empirical evidence that AIC and friends don't work, not arm-waving.

Anon.

unread,
Feb 14, 2006, 11:25:26 AM2/14/06
to
David Spiegelhalter was talking about these ideas at a workshop I
organised this weekend. For hierarchical models and Bayesian model
fitting, he was suggesting that the choice of DIC, AIC or Bayes factors
(equivalent to BIC, sort of) depends on the focus of your prediction.
Hopefully he'll write something up about it soon.

Reef Fish

unread,
Feb 14, 2006, 3:38:23 PM2/14/06
to

Thom wrote:
> I left "better" undefined deliberately - because it will depend on what
> the context is and what your goal is.

So did I, for the same reasons.

> I think I understand you point more fully now.

You seem to.

> Some of the plus points of AIC etc. are more evident in
> testing competing theories, but I can also see the usefulness of it as
> a tool to weight models in prediction, given that simply selecting the
> best model can lead to bias (at least that's my reading of Burnham and
> Anderson's model averaging work):
>
> Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference:
> Understanding AIC and BIC in model
> 1630 selection. Sociological Methods and Research, 33, 261-304

I have not read that book, but I understand what you said about it.
>From a purely mathematical statistics point of view, there is something
said about "testing" competing models. But once you get away from
the hand-cuffs of mathematical statistics, into Data Analysis (see
Tukey's paper on "The Future of Data Analysis" in 1962)

http://www.gap-system.org/~history/Mathematicians/Tukey.html

in which Tukey introduced the term "Data Analysis" to distinguish
that kind of statistics from "confirmatory statistics" (emphasising
testing "as a drunk uses a lampost for support"), you'll see less and
less importance attached to formal models, tests, MLE, etc., etc.

It was a path-breaking article not only in introducing Data Analysis
and its future, but set records in the LENGTH (over 60 pages I
vaguely recall) and the ratio of Exposition (words) to mathematical
formulas or symbols in the journal (Annals of Math Stat) published
COMPARABLE to a long-jump of 20 meters which will likely
never be broken, unlike the 8.90 meter record jump of Bob
Beaman in 1968 which astounded the world and shattered the
world record (but was broken in 1991, by Powell's 8.95 meter
jump).


In fact, MLE is about the worse thing one can do in Exploratory
Data Analysis because one would be putting all the eggs into the
ASSUMED likelihood model (also AIC's weakness), and if that
assumed model is wrong, one might be floating "down the s-creek
without a paddle" as the saying goes.

>
> Incidentally, in the same paper they argue that BIC isn't Bayesian - in
> that both AIC and BIC can be derived from a Bayesian or non-Bayesian
> perspective. They argue the real difference is the assumptions about
> the 'correct' model that is being selected.

Most of those THEORIES that are justified from both Bayesian and
non-Bayesian points of view is to use the likelihood principle based
on "diffuse or uninformative priors" which already made it
non-Bayesian,
which reduces the posterior distribution into the same likelihood
function. But the PHILOSOPHY of the use of said likelihood function
is quite different.

There is some "tautology" in your last sentence. How can you tell
what the "correct model" is, based on ASSUMING the model you
use is correct? Re-phrased another way -- how do you know that
MLE is the correct approach if you are not sure what the likelihood
function is? There are other variants of this kind of difference
between the "classical" math-stat approach and the Data Analysis
approach.

-- Bob.

Greg Heath

unread,
Feb 14, 2006, 7:50:37 PM2/14/06
to
G Robin Edwards wrote:
> In article <dssvgu$a...@phys-news4.kolumbus.fi

> I'm trying to follow this thread, but failing as far as I can tell.
> What would really help would be a **complete** numerical example.
> That is, all data and analyses with an elbow plot, AIC, BIC and all the
> standard stuff like RSS and the inferences that might be drawn from
> them, including further recommended lines of analysis.
>
> Perhaps there's a site or a book that provides something of the kind.
>
> Any suggestions?

1. Crosspost to all 3 sci.stat.* groups.
2. Agree on objectives, e.g.,
a. Deduction of the "optimum" number of variables?
b. Selection of the "optimum" combination of variables?
c. Minimization of the population expected value of some
evaluation criterion?
3. Agree on data structure, e.g.,
a. Number of candidate variables
b. Correlation coefficient matrix
c. Variable probability distributions
d. Number of training cases
4. Agree on measures of performance
etc.

I'm sure this has been done many times over. It would
be nice if someone could provide specific references
and/or run a computer experiment.

Hope this helps.

Greg

Reef Fish

unread,
Feb 14, 2006, 8:43:32 PM2/14/06
to

Greg Heath wrote:
> G Robin Edwards wrote:
> > In article <dssvgu$a...@phys-news4.kolumbus.fi
> >, Anon. <bob.oh...@NOSPAMhelsinki.fi> wrote:
> > > Reef Fish wrote:
> > > > Anon. wrote:
> > > >>Reef Fish wrote:
> > > >>>Anon. Bob O'Hara wrote:
> > > <snip>
> > Any suggestions?
>
> 1. Crosspost to all 3 sci.stat.* groups.

This part is fine, because the topics seem appropriate for all
three groups. Those in any group not wish to respond or read,
simply skip.

> 2. Agree on objectives, e.g.,
> a. Deduction of the "optimum" number of variables?
> b. Selection of the "optimum" combination of variables?
> c. Minimization of the population expected value of some
> evaluation criterion?

This would NOT fit under my discussion of Model Building in
Data Analysis. Because it's NONE of the above. Other
may start different threads with those criteria. "optimum" is
an undefined term in Data Analysis.

I can only comment on the data I considered:

> 3. Agree on data structure, e.g.,
> a. Number of candidate variables

Any number. With raw data.

> b. Correlation coefficient matrix

Insufficient without raw data. Unnecessary with data.

> c. Variable probability distributions

None necessary, except assumption on the error distribution.

> d. Number of training cases

Training for what?

> 4. Agree on measures of performance
> etc.

No such thing on Data Analysis approach.

> I'm sure this has been done many times over. It would
> be nice if someone could provide specific references
> and/or run a computer experiment.
>
> Hope this helps.

No. JMO.

-- Bob.

Reef Fish

unread,
Feb 14, 2006, 10:35:56 PM2/14/06
to

Greg Heath wrote:
> Reef Fish wrote:
> > Greg Heath wrote:
> > > Reef Fish wrote:
> > > > Data Matter wrote:
> > > > > Greg Heath wrote:
> > > > > > Reef Fish wrote:
> > > > > >
> > > > > > -----SNIP
>
> > > I'm surprised at this reply since the example in your reference
> > >
> > > http://www.itc.virginia.edu/research/talks/sa01_05.pdf
> >
> > I gave that reference NOT as an endorsememnt of AIC -- if it
> > had the merit, I wouldn't have been discussing the Elbow Rule
> > which does NOT try to optimize anything, but used as a tool
> > to judge HOW to choose from many NON-optimal models of
> > RSS (the criterion of Least Squares). In fact, the minimum
> > RMS would have already gone MUCH too far in the model
> > selection process.
> > >
> > > is offered as support to the assertion that AIC is better than
> > > SSE and RMSE = sqrt(SSE/(n-k)) for variable selection.
>
> Is your RSS the same as my SSE ( sum squared error) ?

Yes. Residual Sum of Squares and Sum of Squares of Errors.

> Is your s the same as my RMSE?

Yes, s for standard deviation of residuals and RMSE for root MSE

In general, there might be instances that these are used to
distinguish the estimated sigma and the theoretical sigma.
But in the context of the discussion, there was never any
implication that anything was NOT the estimates based on data.


> What is your RMS? If it is the same as my RMSE and your
> s then you are saying that the important point is the elbow
> and not the minimum.
>
> Correct?

If I had used RMS, if would have meant Residuals Mean Square,
which is the same as RMSE Residuals Mean Square Error.

So, RMS and RMSE would have been s^2.

The Elbow is generally the more important point than the minimum,
thought they may coincide; and the Elbow has some implication
about the "degree of bent" unless you can think of someone with
several elbows in one arm. So, the Elbow is meant to imply
the most obvious BENT point.

>
> How do things change when nondesign calibration data are
> available to determine p?

There is nothing assume about whether the p is designed or
non-designed. A wrong is a wrong. If someone designed
something specifically involve 9 variables, and 2 of them turned
out to be redundant in the sense of multicollinearity's undesirable
effects, then the Elbow Rule would suggest or dictate that two
of the designed variables be thrown out of the model.

In the respect above, the design/non-design question is really
irrelevant and a red-herring in the model building processs of
getting rid of redundant variables.

-- Bob.

Greg Heath

unread,
Feb 15, 2006, 1:28:28 AM2/15/06
to

Sorry for the miscommunication. I misused terminology. The
terminology used below is typically used in neural network (NN)
design.

I am considering the scenario commonly used in nonlinear NN
regression where the data is partitioned into 3 distinct subsets:

1. The design training set from which, given q (q = 1,2,...,p)
centered variables, the q weights are estimated.
2. The design validation set from which, given p trained
candidate models (q = 1,2,...,p) and a selection criterion, a
model with q = Q (Q <= p) variables is selected.
3. A nondesign test set from which the population expected
value of evaluation criteria are estimated.

Your original discussion was w.r.t. using training data and
s1 = sqrt(SSE(q)/(n-q)) to estimate Q using the Elbow Rule.

Now a second estimate of s using nontraining data is
s2 = sqrt(SSE(q)/n).

Now I will modify my question to read:

How do things change when nontraining validation data are
available to determine Q?

Unless the training data set is very large, s2 vs q will exhibit
a minimum. The rule of thumb generally used in NN design
is to choose Q to be at this minimum.

I infer from your previous posts that, w.r.t. linear regression,
this might be overfitting.

However, I'd prefer to get your comments on this.

Hope this helps.

Greg

Greg Heath

unread,
Feb 15, 2006, 4:18:06 AM2/15/06
to

Reef Fish wrote:
> Greg Heath wrote:
> > G Robin Edwards wrote:
> > > In article <dssvgu$a...@phys-news4.kolumbus.fi
> > >, Anon. <bob.oh...@NOSPAMhelsinki.fi> wrote:
> > > > Reef Fish wrote:
> > > > > Anon. wrote:
> > > > >>Reef Fish wrote:
> > > > >>>Anon. Bob O'Hara wrote:
> > > > <snip>
> > > Any suggestions?
> >
> > 1. Crosspost to all 3 sci.stat.* groups.
>
> This part is fine, because the topics seem appropriate for all
> three groups. Those in any group not wish to respond or read,
> simply skip.
>
> > 2. Agree on objectives, e.g.,
> > a. Deduction of the "optimum" number of variables?
> > b. Selection of the "optimum" combination of variables?
> > c. Minimization of the population expected value of some
> > evaluation criterion?
>
> This would NOT fit under my discussion of Model Building in
> Data Analysis. Because it's NONE of the above. Other
> may start different threads with those criteria. "optimum" is
> an undefined term in Data Analysis.

It does fit w.r.t. replies that want a comparison of the Elbow
Rule with AIC, BIC, etc

> I can only comment on the data I considered:

Yes. But there were no comparisons. I agree with the value
of the scree plot and Elbow Rule. However, there are questions
w.r.t. how does that compare with other criteria that are used.

> > 3. Agree on data structure, e.g.,
> > a. Number of candidate variables
>
> Any number. With raw data.

I would think you would want at least 4 variables but less than
15. Not sure what you mean by "raw" data. Do you mean
real world data?

> > b. Correlation coefficient matrix
>
> Insufficient without raw data. Unnecessary with data.

Data doesn't tell you what the population correlations are.
In fact, that's the main drawback of greedy searches, they
are driven by misleading spurious enhancements in
sample correlations.

> > c. Variable probability distributions
>
> None necessary, except assumption on the error distribution.
>
> > d. Number of training cases
>
> Training for what?

Sorry, NN terminology ( Iterative algorithms estimate weights
using training data). Try

d. Number of design observations

> > 4. Agree on measures of performance
> > etc.
> No such thing on Data Analysis approach.

Referring to comparing the use of AIC, BIC , etc to s.

> > I'm sure this has been done many times over. It would
> > be nice if someone could provide specific references
> > and/or run a computer experiment.
> >
> > Hope this helps.
>
> No. JMO.

Probably because you didn't recognize that I was trying to
make sure that there was some agreement w.r.t. how to
compare s and Elbow with whatever alternatives others
wanted to suggest.

Hope this helps.

Greg

Reef Fish

unread,
Feb 15, 2006, 1:53:45 PM2/15/06
to
My latest posts in the "Frank Harrell's 9 Points of 1995" in
response to Jerry Dallal and Bruce Weaver should have
answered all your question implicitly.

You've been wrongly criticiaing a COMPUTATIONAL
method as if it were a formal statistical inference method.

My short comment below only pertain to points that are
clear and easy.

Why would you make that arbitrary choice of 4 to 15? You
would have already badly violated Savage's Elephant Rule
and the principle of parsimony.

You claim to analyze data, and have not heard of the term
"raw data"? I used it simply to mean NOT data such as
the correlation matrix that can be derived from the "raw data",
the "original basic data" whether it's real or imaginary.

>
> > > b. Correlation coefficient matrix
> >
> > Insufficient without raw data. Unnecessary with data.
>
> Data doesn't tell you what the population correlations are.
> In fact, that's the main drawback of greedy searches, they
> are driven by misleading spurious enhancements in
> sample correlations.

Since when does the "usual assumption" in a multiple regression
care about the population correlations of the X's? The model
is CONDITIONED on the given X;s.

>
> > > c. Variable probability distributions
> >
> > None necessary, except assumption on the error distribution.

You need to review the "usual assumptions" in a multiple regression
problem.


> >
> > > d. Number of training cases
> >
> > Training for what?

I am familiar with the misuse of that term. :-) That's my joke
about your mistaken notion of inference when everything is
condition on the actual SINGLE SAMPLE used.

>
> Sorry, NN terminology ( Iterative algorithms estimate weights
> using training data). Try
>
> d. Number of design observations
>
> > > 4. Agree on measures of performance
> > > etc.
> > No such thing on Data Analysis approach.
>
> Referring to comparing the use of AIC, BIC , etc to s.

AIC, BIC are inferential approaches. NOT computational methods
for OLS under the usual assumptions.

>
> > > I'm sure this has been done many times over. It would
> > > be nice if someone could provide specific references
> > > and/or run a computer experiment.
> > >
> > > Hope this helps.
> >
> > No. JMO.

The above and what's in my response to Jerry and Bruce
are the more detailed reasons why.

>
> Probably because you didn't recognize that I was trying to
> make sure that there was some agreement w.r.t. how to
> compare s and Elbow with whatever alternatives others
> wanted to suggest.

I recognize the part of what you're TRYING. But you seem
to overlook what I explained about Tukey's brand of Exploratory
Data Analysis, and not formal inference, relating to the
computional methods and the Elbow Rule that helps the
selection of actual variables (and how many) to use, in a
given problem.

-- Bob.

Greg Heath

unread,
Feb 15, 2006, 10:22:47 PM2/15/06
to

You miss the point. How can I effectively compare the Elbow
Rule with alternatives with only 3 variables? And why would I
need more than 15?

> You claim to analyze data, and have not heard of the term
> "raw data"?

During my 47 years of data analysis I've never heard of the term.
However, I do believe in the Easter Bunny.

> I used it simply to mean NOT data such as
> the correlation matrix that can be derived from the "raw data",
> the "original basic data" whether it's real or imaginary.

> > > > b. Correlation coefficient matrix
> > >
> > > Insufficient without raw data. Unnecessary with data.
> >
> > Data doesn't tell you what the population correlations are.
> > In fact, that's the main drawback of greedy searches, they
> > are driven by misleading spurious enhancements in
> > sample correlations.
>
> Since when does the "usual assumption" in a multiple regression
> care about the population correlations of the X's? The model
> is CONDITIONED on the given X;s.

In setting up a controlled experiment to compare the Elbow Rule
with alternatives, using computer generated, rather than real
world data is obviously prefereable. This involves determining
population probability distributions and x-y and x-x correlations
that will yield desired effects when the sampled data is generated
and used to create a regression model.

> > > > c. Variable probability distributions
> > >
> > > None necessary, except assumption on the error distribution.
>
> You need to review the "usual assumptions" in a multiple regression
> problem.

No. The need is for those who take the comparison challenge to
agree on the details of the experiment before any computing
is done.

> > > > d. Number of training cases
> > >
> > > Training for what?
>
> I am familiar with the misuse of that term. :-) That's my joke
> about your mistaken notion of inference when everything is
> condition on the actual SINGLE SAMPLE used.

That's where we differ. I'm an engineer (retired) that is looking
for the best method to choose a parsimonious model that will
have low prediction error when applied to out-of-sample data.

I don't see the point of arguing over methods to obtain a
model which will be used for nothing.

> > Sorry, NN terminology ( Iterative algorithms estimate weights
> > using training data). Try
> >
> > d. Number of design observations
> >
> > > > 4. Agree on measures of performance
> > > > etc.
> > > No such thing on Data Analysis approach.
> >
> > Referring to comparing the use of AIC, BIC , etc to s.
>
> AIC, BIC are inferential approaches. NOT computational methods
> for OLS under the usual assumptions.

Obviously, those who introduced AIC & BIC into the thread, as well
as myself, are interested in the former. Since it seems that Frank's
rules were also concerned with inference, my previous statements
are OT.

> > > > I'm sure this has been done many times over. It would
> > > > be nice if someone could provide specific references
> > > > and/or run a computer experiment.
> > > >
> > > > Hope this helps.
> > >
> > > No. JMO.
>
> The above and what's in my response to Jerry and Bruce
> are the more detailed reasons why.
>
> > Probably because you didn't recognize that I was trying to
> > make sure that there was some agreement w.r.t. how to
> > compare s and Elbow with whatever alternatives others
> > wanted to suggest.
>
> I recognize the part of what you're TRYING. But you seem
> to overlook what I explained about Tukey's brand of Exploratory
> Data Analysis, and not formal inference, relating to the
> computional methods and the Elbow Rule that helps the
> selection of actual variables (and how many) to use, in a
> given problem.

No. I didn't overlook it, I just preferred to concentrate on the
questions whose answers I was trying to find. For example,
how does the Elbow rule compare with suggested alternatives.

Hope this helps.

Greg

Reef Fish

unread,
Feb 16, 2006, 2:46:47 AM2/16/06
to

Greg Heath wrote:
> Reef Fish wrote:
> > My latest posts in the "Frank Harrell's 9 Points of 1995" in
> > response to Jerry Dallal and Bruce Weaver should have
> > answered all your question implicitly.
> >
> > You've been wrongly criticiaing a COMPUTATIONAL
> > method as if it were a formal statistical inference method.

> > You claim to analyze data, and have not heard of the term


> > "raw data"?
>
> During my 47 years of data analysis I've never heard of the term.
> However, I do believe in the Easter Bunny.

That really cleared everything up about where you learned statistics
and where you got your ideas for this discussion. Tell your Easter
Bunny that he turned you loose a bit too early this year.


> That's where we differ. I'm an engineer (retired)

The engineers made it easy to prove that all odd integers are
prime doesn't it? "1 is a prime, 3 is a prime, 5 is a prime, 7 is
a prime, 9 is an experiemntal error, 11 is a prime, ..." Now I see
why are are so keen on doing experiments that have nothing
to do with the COMPUTATIONAL topic being discussed.

> > AIC, BIC are inferential approaches. NOT computational methods
> > for OLS under the usual assumptions.

> > > > > Hope this helps.
> > > >
> > > > No. JMO.
> >
> > The above and what's in my response to Jerry and Bruce
> > are the more detailed reasons why.
> >

> > I recognize the part of what you're TRYING. But you seem
> > to overlook what I explained about Tukey's brand of Exploratory
> > Data Analysis, and not formal inference, relating to the
> > computional methods and the Elbow Rule that helps the
> > selection of actual variables (and how many) to use, in a
> > given problem.
>
> No. I didn't overlook it, I just preferred to concentrate on the
> questions whose answers I was trying to find. For example,
> how does the Elbow rule compare with suggested alternatives.

Very well then, you can go back to your Easter Bunny, and work on
your topic for awhile. Besides, you have already used up all the
consulting time you paid so dearly for.

Best regards to your Easter Bunny Wabbit,

-- Bob.

Greg Heath

unread,
Feb 16, 2006, 1:04:14 PM2/16/06
to
Reef Fish wrote:
> Greg Heath wrote:
> > Reef Fish wrote:
> > > My latest posts in the "Frank Harrell's 9 Points of 1995" in
> > > response to Jerry Dallal and Bruce Weaver should have
> > > answered all your question implicitly.

One reason some of these threads seem to go on forever is
that the replier frequently doesn't accurately read between the
lines of poster. This thread is no exception.

> > > You've been wrongly criticiaing a COMPUTATIONAL
> > > method as if it were a formal statistical inference method.

If you mean criticize, you are mistaken. I have no doubt that the
the Elbow rule is informative. I have used what I call the "Scree
Plot Knee" to deduce p. However, I used s^2 = MSE instead of
s. If you check the back posts you will see that I thanked you
for making clear that you think using s is superior to using s^2.

However, others requested a comparison of using s with AIC,
BIC, etc which you refused to consider. Since I've encountered
people using those and other criteria, I don't think it is
unreasonable to either reference a published comparison
in the spirit of the first reference you gave or to cite reasonable
parameters for those who wish to constuct a demo of their own.

> > > You claim to analyze data, and have not heard of the term
> > > "raw data"?
> >
> > During my 47 years of data analysis I've never heard of the term.
> > However, I do believe in the Easter Bunny.
>
> That really cleared everything up about where you learned statistics
> and where you got your ideas for this discussion.

Not really. I spent 28 years applying statistical pattern recognition
to real world problems. Without a formal background in statistics,
I learned enough on my own to quite successfully solve problems
that baffled others. With respect to statisticians, my knowledge of
statistics is woefully inadequate. However, I am still trying to solve
real world problems in retirement and am anxious to fill those holes
that are relevant to my goals. That's why I'm not ashamed to ask
the questions I do in these sci.stat.* posts.

>Tell your Easter Bunny that he turned you loose a bit too early
> this year.

Hard to do. Won't show up for another month or two. What makes
you think it is a "he"?

> > That's where we differ. I'm an engineer (retired)
>
> The engineers made it easy to prove that all odd integers are
> prime doesn't it? "1 is a prime, 3 is a prime, 5 is a prime, 7 is
> a prime, 9 is an experiemntal error, 11 is a prime, ..." Now I see
> why are are so keen on doing experiments that have nothing
> to do with the COMPUTATIONAL topic being discussed.
>
> > > AIC, BIC are inferential approaches. NOT computational methods
> > > for OLS under the usual assumptions.

Nevertheless, regardless of the original purpose for AIC, BIC,
people are using them to determine p. With that in mind, a
a request for a comparison of effectiveness reference is
reasonable.

> > > > > > Hope this helps.
> > > > >
> > > > > No. JMO.
> > >
> > > The above and what's in my response to Jerry and Bruce
> > > are the more detailed reasons why.
> > >
> > > I recognize the part of what you're TRYING. But you seem
> > > to overlook what I explained about Tukey's brand of Exploratory
> > > Data Analysis, and not formal inference, relating to the
> > > computional methods and the Elbow Rule that helps the
> > > selection of actual variables (and how many) to use, in a
> > > given problem.
> >
> > No. I didn't overlook it, I just preferred to concentrate on the
> > questions whose answers I was trying to find. For example,
> > how does the Elbow rule compare with suggested alternatives.
>
> Very well then, you can go back to your Easter Bunny, and work on
> your topic for awhile. Besides, you have already used up all the
> consulting time you paid so dearly for.

I paid nothing; a fair price.



> Best regards to your Easter Bunny Wabbit,

Will pass it on.

Greg

Reef Fish

unread,
Feb 16, 2006, 3:37:23 PM2/16/06
to

Greg Heath wrote:
> Reef Fish wrote:
> > Greg Heath wrote:
> > > Reef Fish wrote:
> > > > My latest posts in the "Frank Harrell's 9 Points of 1995" in
> > > > response to Jerry Dallal and Bruce Weaver should have
> > > > answered all your question implicitly.
>
> One reason some of these threads seem to go on forever is
> that the replier frequently doesn't accurately read between the
> lines of poster. This thread is no exception.

And you're about to demostrate how true your statement is,
about yourself!


>
> > > > You've been wrongly criticiaing a COMPUTATIONAL
> > > > method as if it were a formal statistical inference method.

Uh that was with reference to

> > > > My latest posts in the "Frank Harrell's 9 Points of 1995" in
> > > > response to Jerry Dallal and Bruce Weaver should have
> > > > answered all your question implicitly.

Nobody here heard of the Elbow Rule until 2006, and Harrell's
criticism was about the COMPUTATIONAL methods of
variable selection such as stepwise regression. Nothing about
the Elbow Rule at all.

Part of the confusion is due to the fact that Harrell's 9 points
was brought out by Jerry in the Elbow Rule thread talking about
how its used for variable selection COMPUTATIONAL methods.

I didn't mean you criticized the Elbow Rule> I referred to your same
confusion as Harrell's confusion in "computational tool" not intended
to be used as one in any formal "statistical inference".

In any event, I don't think there's much to be said on either side or
all
sides about Harrell's 9 Points.

>
> If you mean criticize, you are mistaken. I have no doubt that the
> the Elbow rule is informative. I have used what I call the "Scree
> Plot Knee" to deduce p. However, I used s^2 = MSE instead of
> s. If you check the back posts you will see that I thanked you
> for making clear that you think using s is superior to using s^2.

Actually the difference between using s and s^2 is almost negligible
relative to the elbow rule because they both turn up at the
same place. s or s^2 is better than SSE which is monotonically
non-increasing. So, if you thanked me, you thanked the wrong
thing.


> > > > You claim to analyze data, and have not heard of the term
> > > > "raw data"?

Seriously, that's such a term of common usage to mean before the
data is edited, transformed, or massaged. (Hence "raw" possibly
related to "cooked" by those who use "cook book" in statistics --
heard of that one haven't you?

> > >
> > > During my 47 years of data analysis I've never heard of the term.
> > > However, I do believe in the Easter Bunny.
> >
> > That really cleared everything up about where you learned statistics
> > and where you got your ideas for this discussion.

That's a bit of humor/sarcasm thrown in to lighten the mood, as
in my reference to the engineer's proof. No need to get all worked
up about it.


>
> Not really. I spent 28 years applying statistical pattern recognition
> to real world problems. Without a formal background in statistics,
> I learned enough on my own to quite successfully solve problems
> that baffled others. With respect to statisticians, my knowledge of
> statistics is woefully inadequate. However, I am still trying to solve
> real world problems in retirement and am anxious to fill those holes
> that are relevant to my goals. That's why I'm not ashamed to ask
> the questions I do in these sci.stat.* posts.
>
> >Tell your Easter Bunny that he turned you loose a bit too early
> > this year.
>
> Hard to do. Won't show up for another month or two. What makes
> you think it is a "he"?

Because his name is Harvey?

< snip >

> > Very well then, you can go back to your Easter Bunny, and work on
> > your topic for awhile. Besides, you have already used up all the
> > consulting time you paid so dearly for.
>
> I paid nothing; a fair price.

A bit humor-impaired eh?

And you certain got what you paid for -- that was the idea.

> > Best regards to your Easter Bunny Wabbit,
>
> Will pass it on.
>
> Greg

-- Bob.

Greg Heath

unread,
Feb 17, 2006, 3:08:43 AM2/17/06
to

Reef Fish wrote:
> Greg Heath wrote:
> > Reef Fish wrote:
> > > Greg Heath wrote:
> > > > Reef Fish wrote:
> > > > > My latest posts in the "Frank Harrell's 9 Points of 1995" in
> > > > > response to Jerry Dallal and Bruce Weaver should have
> > > > > answered all your question implicitly.
> >
> > One reason some of these threads seem to go on forever is
> > that the replier frequently doesn't accurately read between the
> > lines of poster. This thread is no exception.
>
> And you're about to demostrate how true your statement is,
> about yourself!

Among others.

Again a miscommunication. I use s^2 which I inferred to be inferior
to s after reading:

%%%%%%% BEGIN INSERT %%%%%%%%%%%%%%%%

From: Reef Fish
Date: Mon, Feb 13 2006 2:01 pm

> To bring it back to the original post (the suggestion of plotting the
> standard deviation of residuals against p and looking for a
> change-point), the decrease in s after the changepoint is just due due
> to random associations with variables. If we could find a relationship
> for how large this change is (i.e. hte slope of the relationship with
> p), then we could add that to s, and find the minimum. i.e. we penalise
> model fit with a complexity term. Of course, this is what AIC and BIC
> do, but they use s^2 rather than s, and they differ in how large they
> think the term should be.

Anon Bob O'Hara,

You are indeed true to form in your post. Your paragraph merely
showed that missed ALL the important points I've discussed. Why

s is better than s^2, ...

%%%%%%% END INSERT %%%%%%%%%%%%%%%%

> > > > > You claim to analyze data, and have not heard of the term
> > > > > "raw data"?
>
> Seriously, that's such a term of common usage to mean before the
> data is edited, transformed, or massaged. (Hence "raw" possibly
> related to "cooked" by those who use "cook book" in statistics --
> heard of that one haven't you?

Seriously, I was just trying to clear up in my mind whether you were
advocating real world data or computer simulated data which would
come from sampling given population marginals and correlation
structure.

> > > > During my 47 years of data analysis I've never heard of the term.
> > > > However, I do believe in the Easter Bunny.
> > >
> > > That really cleared everything up about where you learned statistics
> > > and where you got your ideas for this discussion.
>
> That's a bit of humor/sarcasm thrown in to lighten the mood, as
> in my reference to the engineer's proof. No need to get all worked
> up about it.

I accept your apology.

> > Not really. I spent 28 years applying statistical pattern recognition
> > to real world problems. Without a formal background in statistics,
> > I learned enough on my own to quite successfully solve problems
> > that baffled others. With respect to statisticians, my knowledge of
> > statistics is woefully inadequate. However, I am still trying to solve
> > real world problems in retirement and am anxious to fill those holes
> > that are relevant to my goals. That's why I'm not ashamed to ask
> > the questions I do in these sci.stat.* posts.
> >
> > >Tell your Easter Bunny that he turned you loose a bit too early
> > > this year.
> >
> > Hard to do. Won't show up for another month or two. What makes
> > you think it is a "he"?
>
> Because his name is Harvey?
>
> < snip >
>
> > > Very well then, you can go back to your Easter Bunny, and work on
> > > your topic for awhile. Besides, you have already used up all the
> > > consulting time you paid so dearly for.
> >
> > I paid nothing; a fair price.
>
> A bit humor-impaired eh?

Different strokes for different folks.

Greg
-----SNIP

Reef Fish

unread,
Feb 17, 2006, 7:18:12 AM2/17/06
to
Greg, you are almost as hopeless as Anon Bob, and catching up fast.

Greg Heath wrote:
> >
> > In any event, I don't think there's much to be said on either side or
> > all sides about Harrell's 9 Points.

Greg persisted ...


> >
> > > If you mean criticize, you are mistaken. I have no doubt that the
> > > the Elbow rule is informative. I have used what I call the "Scree
> > > Plot Knee" to deduce p. However, I used s^2 = MSE instead of
> > > s. If you check the back posts you will see that I thanked you
> > > for making clear that you think using s is superior to using s^2.
> >
> > Actually the difference between using s and s^2 is almost negligible
> > relative to the elbow rule because they both turn up at the
> > same place. s or s^2 is better than SSE which is monotonically
> > non-increasing. So, if you thanked me, you thanked the wrong
> > thing.
>
> Again a miscommunication. I use s^2 which I inferred to be inferior
> to s after reading:

That's only because what you cited is OUT OF CONTEXT, as usual!

S^2 is NOT inferior in terms of the turn-up behavior as I explained
above.

It's inferior In the CONTEXT that it's in the wrong unit as the yard
stick for Y or the prediction intervals! ENTIRELY different contexts
as lawyer-wannabee Greg, and statistician-wanabee Greg, goofed
again.


>
> %%%%%%% BEGIN INSERT %%%%%%%%%%%%%%%%
>
> From: Reef Fish
> Date: Mon, Feb 13 2006 2:01 pm
>
> > To bring it back to the original post (the suggestion of plotting the
> > standard deviation of residuals against p and looking for a
> > change-point), the decrease in s after the changepoint is just due due
> > to random associations with variables. If we could find a relationship
> > for how large this change is (i.e. hte slope of the relationship with
> > p), then we could add that to s, and find the minimum. i.e. we penalise
> > model fit with a complexity term. Of course, this is what AIC and BIC
> > do, but they use s^2 rather than s, and they differ in how large they
> > think the term should be.
>
> Anon Bob O'Hara,
>
> You are indeed true to form in your post. Your paragraph merely
> showed that missed ALL the important points I've discussed. Why
> s is better than s^2, ...
>
> %%%%%%% END INSERT %%%%%%%%%%%%%%%%

You missed it too, Greg. TWICE, in two different contexts about
what's better between "s and s^2". Join the club.


> > > > > During my 47 years of data analysis I've never heard of the term.
> > > > > However, I do believe in the Easter Bunny.
> > > >
> > > > That really cleared everything up about where you learned statistics
> > > > and where you got your ideas for this discussion.
> >
> > That's a bit of humor/sarcasm thrown in to lighten the mood, as
> > in my reference to the engineer's proof. No need to get all worked
> > up about it.
>
> I accept your apology.

That was no apology -- that was an exhibit showing that your were
"humor-impaired" (a term used later) for another instance of the same:

RF to Greg> > A bit humor-impaired eh?

There is no need and no point for you to apologize.

As far as these regression threads go, Greg, you have gone FAR
beyond the "productive Elbow," continued downwards into the world of
your Easter Bunny Wabbit fantasy, and continued FAR beyond that
with your own contrived confusion for your misreading because your
Statistician-wanabee aspiration failed, and so did your lawyer-
wanabee exhibit succeeded only in shooting your own foot.

But you still have promise in being a Easter Bunny and fictional
writer of Children's Fantasy books. But I think you have a better
forum for those discussions in some other newsgroups, such as
those in which you've active for several years before you
accidentally stumbled in these sci.stat.* groups. this year, in
the "basic regression question" thread in sci.stat.consult about
January 19, 2006?

JMHSHO.

> Different strokes for different folks.
>
> Greg

That's fer shoa! I think you stroke much better in them thar
comp.ai.neural.nets and comp.soft.sys.matlab and other
non-statistics groups, don't you think?

-- Bob.

Greg Heath

unread,
Feb 18, 2006, 2:21:05 AM2/18/06
to
Reef Fish wrote:
> Greg, you are almost as hopeless as Anon Bob, and catching up fast.
>
> Greg Heath wrote:
> > >
> > > In any event, I don't think there's much to be said on either side or
> > > all sides about Harrell's 9 Points.

I ADMIT it!

It is HOPELESS for us to have a meaningful exchange when
you attribute YOUR QUOTE to me! It is so HOPELESS that I've
been driven to use ALL CAPS and EXCLAMATION points!

> Greg persisted ...
> > >
> > > > If you mean criticize, you are mistaken. I have no doubt that the
> > > > the Elbow rule is informative. I have used what I call the "Scree
> > > > Plot Knee" to deduce p. However, I used s^2 = MSE instead of
> > > > s. If you check the back posts you will see that I thanked you
> > > > for making clear that you think using s is superior to using s^2.
> > >
> > > Actually the difference between using s and s^2 is almost negligible
> > > relative to the elbow rule because they both turn up at the
> > > same place. s or s^2 is better than SSE which is monotonically
> > > non-increasing. So, if you thanked me, you thanked the wrong
> > > thing.

Thing? I tried to thank a human.

> > Again a miscommunication. I use s^2 which I inferred to be inferior
> > to s after reading:

Hey! What did you do with the SMOKING GUN quote?

> That's only because what you cited is OUT OF CONTEXT, as usual!
>
> S^2 is NOT inferior in terms of the turn-up behavior as I explained
> above.

Make up your mind.

It's obvious that s and s^2 have minima at the same value, say P*.
However, we agree that the best choice is Pelbow < P*. Since
the position of the elbow is a subjective choice, you might choose
choose different values using s than another using s^2.

> It's inferior In the CONTEXT that it's in the wrong unit as the yard
> stick for Y or the prediction intervals!

Oh. Is that what you MEANT to say?

> ENTIRELY different contexts
> as lawyer-wannabee Greg, and statistician-wanabee Greg, goofed
> again.
> >
> > %%%%%%% BEGIN INSERT %%%%%%%%%%%%%%%%
> >
> > From: Reef Fish
> > Date: Mon, Feb 13 2006 2:01 pm
> >
> > > To bring it back to the original post (the suggestion of plotting the
> > > standard deviation of residuals against p and looking for a
> > > change-point), the decrease in s after the changepoint is just due due
> > > to random associations with variables. If we could find a relationship
> > > for how large this change is (i.e. hte slope of the relationship with
> > > p), then we could add that to s, and find the minimum. i.e. we penalise
> > > model fit with a complexity term. Of course, this is what AIC and BIC
> > > do, but they use s^2 rather than s, and they differ in how large they
> > > think the term should be.
> >
> > Anon Bob O'Hara,
> >
> > You are indeed true to form in your post. Your paragraph merely
> > showed that missed ALL the important points I've discussed. Why
> > s is better than s^2, ...
> >
> > %%%%%%% END INSERT %%%%%%%%%%%%%%%%
>
> You missed it too, Greg. TWICE, in two different contexts about
> what's better between "s and s^2". Join the club.

I guess I did miss it. So far I can only think of ONE context that
you have finally made clear.

Please find a tutorial on how to use a search engine.

Of my posts to several hundred sci.stat.* threads, the first was in
July 1996. Curiously, only three of those had contributions by you:
the current two and one in July 2005 where we appeared to agree
(Imagine!).

> JMHSHO.
>
> > Different strokes for different folks.
> >
> > Greg
>
> That's fer shoa! I think you stroke much better in them thar
> comp.ai.neural.nets and comp.soft.sys.matlab and other
> non-statistics groups, don't you think?

Yes.

Primarily, I am there to help and am here to learn.

Thanks for making everything so clear.

Don't you just hate it when you are trying to learn something
and have to read between the lines to guess the context of an
omniscient-wannabee poster?

Jiminy Crickets! Threads like that could go on forever.

Happy Easter,

Greg

Anon.

unread,
Feb 18, 2006, 2:40:20 AM2/18/06
to
Greg Heath wrote:
> Reef Fish wrote:
>
<snip>

>>>>Actually the difference between using s and s^2 is almost negligible
>>>>relative to the elbow rule because they both turn up at the
>>>>same place. s or s^2 is better than SSE which is monotonically
>>>>non-increasing. So, if you thanked me, you thanked the wrong
>>>>thing.
>
>
> Thing? I tried to thank a human.
>
Well done, Reef Fish! You've passed the Turing test!

:-)

Reef Fish

unread,
Feb 18, 2006, 10:10:04 PM2/18/06
to
In the post of Greg Anon Bob O'Hara cited, my opening line was:

> Greg, you are almost as hopeless as Anon Bob, and catching up fast.

I read ALL posts in the threads I started, and that's the only reason
Bob O'Hara's was read, in case it's one of those rare occasions he
actually something sensible to say.


Anon. wrote:
> Greg Heath wrote:
> > Reef Fish wrote:
> >
> <snip>
> >>>>Actually the difference between using s and s^2 is almost negligible
> >>>>relative to the elbow rule because they both turn up at the
> >>>>same place. s or s^2 is better than SSE which is monotonically
> >>>>non-increasing. So, if you thanked me, you thanked the wrong
> >>>>thing.
> >
> >
> > Thing? I tried to thank a human.
> >
> Well done, Reef Fish! You've passed the Turing test!

You have just PROVEN my statement to Greg about how hopeless
you are. I guess you are afraid Greg might catch up with you.

Don't worry, you are leading the race of hopelessness and
ignorance in statistics by an insurmountable margin by Greg now.
>
> :-)


>
> --
> Bob O'Hara
> Department of Mathematics and Statistics
> P.O. Box 68 (Gustaf Hällströmin katu 2b)
> FIN-00014 University of Helsinki
> Finland

In Greg's case, he was never a statistician, and only stumbled into
a statistics forum sci.stat.consult by mistake.

You, Bob O'Hara, is not only making a pest of yourself by your
persistent and continuing display of your ignorance in topics
in statistics, your latest acts of frivol should put not only YOU
by your entire Department of Mathematics and STATISTICS
and the University of Helsinki to shame, for having such an
ignorant buffoon.

buf·foon ( bə-fūn ' ) n. A clown; a jester: a court buffoon

You should join the Dysfunctional Gang in the other group you've
seen. Some of them even know more about statistics than you
do. But your dysfunctional mentality is very compatible with theirs.

-- BOb.

0 new messages