Subset model selection in regression: Negative values of Mallow's Cp

hansra...@gmail.com

unread,

Jul 9, 2016, 4:43:35 PM7/9/16

to

Hi,

I want some clarification regarding the use of Cp statistic (Mallows, 1973) in selection of best subset models in multiple regressions.

It is suggested that to select the best models from a group of different regression models of variable size, the criterion Cp=p should be used where Cp is the Mallows statistic and p is the number of variables in the regression model+1(constant).

But it is possible to get negative values for Cp in which case Cp-p becomes more negative. So now I have some models with Cp-p approx 0 and some with Cp-p<0. Few subsequent books (Weisberg 1985) attribute to Mallows (1973) the rule that that models with -ve values for Cp-p are good models although in the paper there is nothing approximating such a statement.

How should I proceed in such cases?

How acceptable is this rule: models with least Cp-p (even if it is negative and may go upto -4)? Kindly also forward any reference for this clarification.

Best regards
Hansraj

Rich Ulrich

unread,

Jul 10, 2016, 12:52:11 AM7/10/16

to

The Wikip article on Mallows's Cp says to use the value that
of Cp where Cp approaches p from above. The article does
not say which of the half-dozen references that may have
come from.

For a good overview of "stepwise" models, please Google to
find Frank Harrell's commentary, which I promoted here after
he posted it here, many years ago. It points out problems
with stepwise procedures.

My own opinion: You can't use a simple rule like Cp, in general.

For instance, if there are randomly-gathered, irrelevant variables,
step-wise procedures will, after the first variable or two, prefer
bad models over good models -- because the random predicton
observed for "totally irrelevant" variables will not be weakened
by correlation with the obvious predictors.

And there should be a more reasoned approach when you use
variables that were designed for the question. Like, finding
sensible latent factors and/or creating composite variables
to increase relliability of assessments. But there is no answer
that is one-size-fits all.

The "modern" approach to such multiple models is some version
of massive over-sampling so that the models can be tested by
extensive cross-validation. And then they still need to be careful
of obtaining Residuals that are smaller than what Makes Sense.

--
Rich Ulrich

hansra...@gmail.com

unread,

Jul 10, 2016, 9:36:29 AM7/10/16

to

The answer did not help.

My question is that when one is using Mallows' Cp what is the acceptable norm (provided that sample size is decent and the explanatory variables used in multiple regression are reasonably expected to affect the dependent var). Of course, Cp statistic has its shortfalls and there are other arguable statistics like AIC and BIC that can be used for model selection.

Would help if someone can answer with respect to the Cp statistic only.

Cheers,
Hansraj

Rich Ulrich

unread,

Jul 10, 2016, 1:50:11 PM7/10/16

to

On Sun, 10 Jul 2016 06:36:24 -0700 (PDT), hansra...@gmail.com
wrote:

>
>The answer did not help.
>
>My question is that when one is using Mallows' Cp what is the acceptable norm (provided that sample size is decent and the explanatory variables used in multiple regression are reasonably expected to affect the dependent var). Of course, Cp statistic has its shortfalls and there are other arguable statistics like AIC and BIC that can be used for model selection.
>
>Would help if someone can answer with respect to the Cp statistic only.

The Wikip article asserts, early on,
Mallows's Cp has been shown to be equivalent to Akaike information
criterion in the special case of Gaussian linear regression.[3]
And that [3] is a 2013 reference. So Cp and AIC must be equally
questionable.

You mean, "How should you proceed"? - Take the advice and
ignore models with Cp < p because that /does/ indicate over-fitting.
Logically. I can see it. Your problem seems to be that you don't find
that advice explicitly in Mallows's original paper. And you wonder if
this is a myth that has survived the last 40 years without being
questioned. Well, what got questioned (as I see it) is the worth of
that sort of judgement, so the details became moot. Perhaps you
could write to one of the authors who seemed to mis-assign credit
to him, if you wish to persevere.

This SPSS group, and the stats groups also, tended towards the
social sciences when they were active; this SPSS group is especially
for SPSS advice, though we do what we can on the other questions.

I pointed out that the whole area of "stepwise" is now regarded with
some skepticism in the social sciences. If your particular special
area still likes stepwise, then you should probably search for advice
or examples in the literature of your own area.

There are textbooks on data-mining, where "selection" is a renewed
problem. I can't say whether I ran across Cp in my textbook scanning
of data-mining, years ago. But I think that cross-validation was the
general advice.

--
Rich Ulrich

hansra...@gmail.com

unread,

Jul 10, 2016, 7:08:44 PM7/10/16

to

Thanks for the advice Rich.