Re: {MEDSTATS} Help with multiple regression

54 views
Skip to first unread message

Peter Flom

unread,
Jul 2, 2009, 4:26:20 PM7/2/09
to MedStats
jabs <jaben...@gmail.com> wrote
>Hello folks
>I am a physician who works in Mexico. I would like to predict the
>weight of a fetus before birth through ultrasound measurements. There
>are many studies which have published an equation or a formulae in
>order to estimate fetal weight, and the equation has been obtained
>from independent variables (parameters of ultrasound). Unfortunately,
>none of these studies has been done in Mexican population.
>I have collected the birth weight (dependent variable) of almost 500
>newborns (NB). I hav also collected 13 ultrasound measurements
>(independent variables) per fetus in the 48 hours prior to birth
>(prenatal stage). My goal is to find an equation or formula to predict
>the weight of the baby using ultrasound variables (independent
>variables). I have read about this and I think I have to run a linear
>regression in which the dependent variable would be the birth weight,
>and ultrasound variables would be included as independent variables.

So far so good ....

>According to what I have read, I have to choose a selection of
>variables backwards method by which I will obtain a linear model.


Not good at all. Backwards methods (and other automatic variable selection methods)
are not good. They are commonly used, but they are wrong.


The
>problem is that I have no experience on how to perform this. Even
>though, I have tried to do it using SPSS software and after running
>the regression, at the results window I get a series of data such us
>tables (descriptive statistics, correlation, included/deleted
>variables, a summary model, ANOVA, analysis of colinearity, excluded
>variables), and Graphics. What is the right way to run the multiple
>regression? How can I get the model from these data? Which data must
>be included in the equation? Thanks in advance for your help.

You might try asking on an SPSS list, for details of how to do things in SPSS,
but which variables you should use is not dependent on software. If you
are trying to replicate previous results, you should use the same variables.

With 500 newborns, you could use all 13 variables - unless there are collinearity problems.

Or you might want to use something like principal component regression, or partial least squares;
you might be concerned with possible nonlinear effects; there are other possibilities as well.

Peter


Peter L. Flom, PhD
Statistical Consultant
www DOT peterflomconsulting DOT com

Christian Lerch

unread,
Jul 2, 2009, 4:49:08 PM7/2/09
to MedS...@googlegroups.com
snip-----------------

> With 500 newborns, you could use all 13 variables - unless there are
collinearity problems.
snip-----------------

Collinearity is very likely.

Start with a correlation matrix of all 13 measurements [Statistics =>
Correlate => Bivariate...]. Correlation coefficiants above, say, 0.80
usually show that the inclusion of both variables is not necessary or is
even counterproductive.

Regards,
Christian

Peter Flom schrieb:

Bruce Weaver

unread,
Jul 2, 2009, 5:43:03 PM7/2/09
to MedStats
On Jul 2, 4:49 pm, Christian Lerch <t....@gmx.net> wrote:
> snip-----------------
>  > With 500 newborns, you could use all 13 variables - unless there are
> collinearity problems.
> snip-----------------
>
> Collinearity is very likely.
>
> Start with a correlation matrix of all 13 measurements [Statistics =>
> Correlate => Bivariate...]. Correlation coefficiants above, say, 0.80
> usually show that the inclusion of both variables is not necessary or is
> even counterproductive.
>
> Regards,
> Christian

Using bivariate correlations to try to assess multicollinearity is not
a very good idea, IMO. First, you can have complete linear dependence
in the absence of any alarming looking bivariate correlations. To
illustrate, try this example that Jerry Dallal posted in sci.stat.math
a couple years ago:

X1 X2 X3 Y
18 88 106 13
72 45 117 43
36 63 99 50
75 26 101 77
22 83 105 23
99 71 170 68
69 53 122 6
6 49 55 51
86 99 185 37
85 64 149 10
87 7 94 32
93 32 125 69
44 88 132 4
34 34 68 13
84 28 112 18

Check out all of the simple correlations.
Regress Y on X1,X2,X3.

Second, in models that include products or polynomial terms (e.g., a
model with both X and X-squared as predictors), there can be very high
correlations between variables, but no problematic
multicollinearity.

Tolerance and Variance Inflation Factor (which are available in the
SPSS Regression procedure) are better measures of problematic
multicollinearity, I think.

For more info, see the Multicollinearity link here:

http://faculty.chass.ncsu.edu/garson/PA765/regress.htm

Regarding Peter's comments on stepwise selection, here is a good
summary of the problems:

http://www.cmh.edu/stats/faq/faq12.asp

--
Bruce Weaver
bwe...@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/
"When all else fails, RTFM."

SR Millis

unread,
Jul 2, 2009, 6:10:13 PM7/2/09
to MedS...@googlegroups.com

Examining zero order correlations will not necessarily help in detecting high collinearity. The absence of high correlations can't be viewed as evidence of no problem. It's possible for 3 or more variables to be collinear while no 2 of the variables taken alone are highly correlated.

You need to request collinearity diagnostics in linear regression. Then, examine the condition indexes. Identify any that are large, ie, >30 (or even 20). Then, examine the associated variance-decomposition proportions for those large condition indexes. Large VDP (>.50) will identify those variables that are involved in the near dependency.


Scott R Millis, PhD, ABPP (CN,CL,RP), CStat, CSci
Professor & Director of Research
Dept of Physical Medicine & Rehabilitation
Dept of Emergency Medicine
Wayne State University School of Medicine
261 Mack Blvd
Detroit, MI 48201
Email: smi...@med.wayne.edu
Tel: 313-993-8085
Fax: 313-966-7682


--- On Thu, 7/2/09, Christian Lerch <t....@gmx.net> wrote:

Peter Flom

unread,
Jul 2, 2009, 6:17:27 PM7/2/09
to MedS...@googlegroups.com
I wrote

> > With 500 newborns, you could use all 13 variables - unless there are
>collinearity problems.

Christian Lerch <t....@gmx.net> replied

>
>Collinearity is very likely.
>
>Start with a correlation matrix of all 13 measurements [Statistics =>
>Correlate => Bivariate...]. Correlation coefficiants above, say, 0.80
>usually show that the inclusion of both variables is not necessary or is
>even counterproductive.
>

Actually, correlations are neither necessary nor sufficient for collinearity.

Much better to use condition indexes

Reply all
Reply to author
Forward
0 new messages