traditional vs modern selection methods

72 views
Skip to first unread message

Daniel Smith

unread,
Oct 16, 2015, 11:35:14 PM10/16/15
to regmod
Dear Frank

I have read recently that many statisticians believe old fashioned selection methods such as backwards / forwards / stepwise to be a bit of a joke (in the light of modern alternatives). Is say fitting a regression model and then dropping terms and looking at the change in significance using ANOVA still valid? But I do still see fastbw in rms and have seen it used in places. Can you give some guidance on whether this should be used at all? From my reading it seems that penalized regression is able to fit the model and perform selection of variables simultaneously (by shrinking parameter estimates to zero - presumably the terms then fall out of the model – is this correct?). So why bother with backwards selection?

Thanks!

Frank Harrell

unread,
Oct 17, 2015, 8:58:10 AM10/17/15
to regmod
Loaded questions!

Using statistical significance, AIC, Cp, BIC for model selection backfires whether forwards, backwards, or all possible subsets of predictors selection is being used.  fastbw exists for the better-behaved approach of model approximation and for minor trimming of highly insignificant variables in a context where the bootstrap can properly penalize for such data dredging.  Using P-values whether from anova or not to drop or select variables has all the same problems.  fastbw also exists to satisfy users who really want to walk the tightrope.

Methods that simultaneously shrinkage and select variables (e.g., lasso and elastic net) make the variable selection "honest" by penalizing the result for the number of candidate features.  But there is a serioius problem that still lurks.  Just as with naiive unpenalized variable selection, the newer methods have a low probability of finding the truly important variables, especially when co-linearity exists.  This is exposed by running the algorithm multiple times using the bootstrap and noticing that the list of features "selected" is highly unstable.

The desire for parsimony generally creates a large number of problems: loss of predictive discrimination, arbitrariness, instability, complete noise in the presence of co-linearity.

The old ridge regression (quadratic penalty with no attempt at parsimony) approach generally gives the best predictive discrimination while helping to get good calibration.  Here the analyst gives up on trying to find "the" variables.  The model can sometimes be simplified using model approximation (pre-conditioning).

This sort of general question has been addressed at stats.stackexchange.com also.
Reply all
Reply to author
Forward
0 new messages