If I have a set of data with many variables used to predict a single
output variable, what kind of methods are available to do this. As I
understand it, this is much more complicated than standard regression
modelling with just one input, one output variable.
--
-------------------------------------------------------------------
Chris Quigley,Advanced Technology Centre,University of Warwick,
Coventry CV4 7AL,United Kingdom.
: If I have a set of data with many variables used to predict a single
: output variable, what kind of methods are available to do this. As I
: understand it, this is much more complicated than standard regression
: modelling with just one input, one output variable.
What you're referring to is called "multiple regression" ("multivariate
regression" refers to regression models with multiple *dependent*
variables) and it really isn't different from linear regression with a
single independent variable (which is just a special case of the more
general method). The main difference is that you have to worry about
high correlation between independent variables making it difficult to
distinguish their contributions to the dependent variable
(multicollinearity) and you have to take into account different scaling
among the independent variables (done by standardizing coefficients).
> If I have a set of data with many variables used to predict a
> single output variable, what kind of methods are available to
> do this. As I understand it, this is much more complicated
> than standard regression modelling with just one input, one
> output variable.
It's not really more complicated, it's just harder to select the correct
model. With N independent variables, you have 2^N possible sets of
predictors; when you consider transformations and cross-products, the set
of possible models multiplies quickly. Random chance dictates that some of
these models will appear to fit your data well, even if there is no real
effect.
A simple first approach is stepwise multiple regression, available from
any standard statistics package. You run linear regressions, adding and
subtracting independent variables one at a time until you arrive at a
satisfactory solution.
Aaron C. Brown
New York, NY
<< on the question of modeling multiple variables... stuff deleted>>
: It's not really more complicated, it's just harder to select the correct
: model. With N independent variables, you have 2^N possible sets of
: predictors; when you consider transformations and cross-products, the set
: of possible models multiplies quickly. Random chance dictates that some of
: these models will appear to fit your data well, even if there is no real
: effect.
: A simple first approach is stepwise multiple regression, available from
: any standard statistics package. You run linear regressions, adding and
: subtracting independent variables one at a time until you arrive at a
: satisfactory solution.
-- I would be interested in hearing when or why you think that
'stepwise' is useful, since various *objections* have comprised
95% of the comments that I have read in the several Usenet ".stats"
groups. If you REALLY know that all your variables are relevent,
then you can use stepwise to get a non-redundant set. (Or better, use
"all-possible-regressions.") But usually, the user has too many
variables, and not enough knowledge about them, so the result is
a set of variables that do badly, even compared to variables
chosen by chance.
Here are references and comments from Frank Harrell, which I have
re-posted previously in one group or another.
Rich Ulrich, wpi...@pitt.edu
================================================================
Frank E Harrell Jr f...@biostat.mc.duke.edu
Associate Professor of Biostatistics
Division of Biometry Duke University Medical Center
----------------------------------------------------------------------
From: f...@duke.edu (Frank Harrell)
Newsgroups: sci.stat.consult
Subject: Reasons not to do stepwise (or all possible regressions)
Date: 19 Feb 1996 19:22:19 GMT
Organization: Duke University, Durham, NC, USA
Lines: 146
Message-ID: <4gailc$c...@news.duke.edu>
NNTP-Posting-Host: biostat2.mc.duke.edu
Keywords: variable selection
I post this every few months. I hope it helps.
Here are SOME of the problems with stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the
printout do not have the claimed distribution
3. The method yields confidence intervals for effects and predicted
values that are falsely narrow (See Altman and Anderson Stat in Med)
4. It yields P-values that do not have the proper meaning and the
proper correction for them is a very difficult problem
5. It gives biased regression coefficients that need shrinkage
(the coefficients for remaining variables are too large;
see Tibshirani, 1996).
6. It has severe problems in the presence of collinearity
7. It is based on methods (e.g. F tests for nested models) that were
intended to be used to test pre-specified hypotheses.
8. Increasing the sample size doesn't help very much (see
Derksen and Keselman)
9. It allows us to not think about the problem
10. It uses a lot of paper
Note that 'all possible subsets' regression does not solve any of these
problems.
References
----------
@article{alt89,
author = "Altman, D. G. and Andersen, P. K.",
journal = "Statistics in Medicine",
pages = "771-783",
title = "Bootstrap investigation of the stability of a {C}ox
regression model",
volume = "8",
year = "1989"
Shows that stepwise methods yields confidence limits that are far too narrow.
}
@article{der92bac,
author = {Derksen, S. and Keselman, H. J.},
journal = {British Journal of Mathematical and Statistical Psychology},
pages = {265-282},
title = {Backward, forward and stepwise automated subset selection algorithms: {F}requency of obtaining authentic and noise variables},
volume = {45},
year = {1992},
annote = {variable selection}
Conclusions:
``The degree of correlation between the predictor variables affected
the frequency with which authentic predictor variables found their way
into the final model.
The number of candidate predictor variables affected the number of
noise variables that gained entry to the model.
The size of the sample was of little practical importance in
determining the number of authentic variables contained in the final
model.
The population multiple coefficient of determination could be
faithfully estimated by adopting a statistic that is adjusted by
the total number of candidate predictor variables rather than the
number of variables in the final model''.
}
@article{roe91pre,
author = {Roecker, Ellen B.},
journal = {Technometrics},
pages = {459-468},
title = {Prediction error and its estimation for subset--selected models},
volume = {33},
year = {1991}
Shows that all-possible regression can yield models that are "too small".
}
@article{man70why,
author = {Mantel, Nathan},
journal = {Technometrics},
pages = {621-625},
title = {Why stepdown procedures in variable selection},
volume = {12},
year = {1970},
annote = {variable selection; collinearity}
}
@article{hur90,
author = "Hurvich, C. M. and Tsai, C. L.",
journal = American Statistician,
pages = "214-217",
title = "The impact of model selection on inference in linear regression",
volume = "44",
year = "1990"
}
@article{cop83reg,
author = {Copas, J. B.},
journal = "Journal of the Royal Statistical Society B",
pages = {311-354},
title = {Regression, prediction and shrinkage (with discussion)},
volume = {45},
year = {1983},
annote = {shrinkage; validation; logistic model}
Shows why the number of CANDIDATE variables and not the number in the
final model is the number of d.f. to consider.
}
@article{tib96reg,
author = {Tibshirani, Robert},
journal = "Journal of the Royal Statistical Society B",
pages = {267-288},
title = {Regression shrinkage and selection via the lasso},
volume = {58},
year = {1996},
annote = {shrinkage; variable selection; penalized MLE; ridge regression}
}