The "most important variable" ??

139 views
Skip to first unread message

Ted Harding

unread,
Sep 3, 2007, 1:41:58 PM9/3/07
to MedS...@googlegroups.com
Hi Folks,

I'm being asked some questions about "identifying
the most important variable" in a multiple regression.

This is partly because I'm in a muddle about what
"most important" is supposed to mean -- If I were
clear about that, then I could work it out!
There seems to be a variety of interpretations
and usages of "most important variable" in the
literature.

It's partly also because of the variety of procedures
which are employed for "identifying" it. E.g.
a) The variable with the smallest P-value
b) The variable which makes the most change to R^2
when it is left out of the model
c) The variable which produces the smallest residuals
when the variables are fitted singly (i.e. by
a simple regression on one variable at a time)
...

I can produce counter-examples to common sense for
most interpretations.

For example, variables X and Y influence an outcome Z
linearly (and positively).

Variable Y has a higher mean level in the population
than X (so the Z level in the population is mainly
contributed to by Y).

Variable Y is more potent than X: its (positive)
coefficient is greater than X's (so changes in Y
produce bigger changes in Z than do changes in X).

Yet, on the "change in R^2" criterion, Variable X
is "more important" than Variable Y (because the
population Standard Deviation of X is much greater
than that of Y, so it accounts for most of the
variation in Z).

Not forgetting that when you adopt a "leave-one-out"
approach, and you have covariates which are seriously
correlated with each other, then variables will have
"importance" in common.

And not forgetting, either, that when you are looking
at the results of a multiple regression for this purpose,
the coefficients, P-values, and what-not will depend
in general on the order of terms in the model, and also
on the system of contrasts used to generate the coefficient
estimates.

One question I have is the following. The prevalence of
"most important variable" in the bio/med/epi literature
and discourse suggests that it is a "common currency"
term in that community. But I have yet to come across
a definitive statement of what it is about. Is there,
therefore, a common and uniform understanding in that
community of what it is?

If so, what -- in formal and precise terms -- is that
understanding?

Another question: What authoritative literature might
one consult for a good discussion of "importance"?

Frankly, this strikes me as a very murky area, and
I would be reluctant to "bless" any particular approach
or interpretation unless I was sure of what was meant.

With thanks, and best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.h...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 03-Sep-07 Time: 18:41:03
------------------------------ XFMail ------------------------------

Richard Goldstein

unread,
Sep 3, 2007, 2:01:21 PM9/3/07
to MedS...@googlegroups.com
You can find an annotated bibliography of the literature
into the late 1990's at:

http://www.nuff.ox.ac.uk/sociology/alcd/relimp.pdf

Rich Goldstein

Ted Harding

unread,
Sep 3, 2007, 2:23:03 PM9/3/07
to MedS...@googlegroups.com
On 03-Sep-07 18:01:21, Richard Goldstein wrote:
>
> You can find an annotated bibliography of the literature
> into the late 1990's at:
>
> http://www.nuff.ox.ac.uk/sociology/alcd/relimp.pdf
>
> Rich Goldstein

Thanks very much indeed for this comprehensive-looking review!
Having looked through it, I feel comforted -- especially by
the final item where the quotation of E.J. Williams' comment:

Dismissive of variable importance assessments through
partitioning of eĆects, except when variables are orthogonal.
"In general the only realistic interpretation of a
regression relation is that the dependent variable
is subject to the combined effect of a number of
variables. Any attempt to take the interpretation
further [by partitioning] can lead only to
misinterpretation and confusion."

very precisely matches the view I seem to be approaching myself
(in particular regarding the orthogonality aspect)!

Best wishes,
Ted.

Date: 03-Sep-07 Time: 19:23:00
------------------------------ XFMail ------------------------------

Ray Koopman

unread,
Sep 3, 2007, 11:36:28 PM9/3/07
to MedStats
<Begin sermon>

There are several properties of regression that are directly
relevant to questions of the relative "importance" or "impact"
of predictors but are widely misunderstood:

1. First and foremost, all the variables that truly matter must be
present in the regression equation. If any important variables are
omitted then the results can be misleading unless all the omitted
variables are uncorrelated with all the included variables. There
is no point in attempting to discover the relative importance of
some predictors for which you have data unless you already know
that these are the only predictors that matter.

2. R^2 can be partitioned into components representing the unique
contribution of each predictor only when all the predictors are
mutually uncorrelated. The problem is not partitioning R^2 -- that
can always be done. The problem is that the results do not always
represent the unique contribution of each predictor. However
intuitively straightforward the notion of unique contributions
may seem, there is no mathematical definition that is entirely
satisfactory when the predictors are correlated.

3. Importance ratings obtained by comparing semipartial correlations
or changes in R^2 (i.e., squared semipartials) depend on the joint
distribution of the predictors. Contrary to what is often implict in
the importance question, the results are not inherent properties of
the variables alone, but joint properties of the variables and the
particular multivariate distribution they happen to have. This is
especially important when the distribution of the predictors is an
artifact -- either a direct artifact, because the investigator set
the values of the predictors; or an indirect artifact, because the
investigator selected cases or sampled nonrandomly. And even if the
sample distribution is a valid estimate of some "natural" population
distribution, if the population distribution changes then the true
semipartials can also change, even though the mechanism relating the
predictors to the outcome variable has not changed.

4. If the predictors are in the same units (possibly after data-
independent unit-equating transformations, which excludes sample-
specific standardization), then comparing the raw-score regression
weights can lead to conclusions of relative importance that are
inherent properties of the variables alone. However, the definition
of importance that is implicit in comparisons of the regression
weights is the expected change in the outcome variable for a unit
change in the predictor in question, with all other predictors held
constant. In some situations this definition may be appropriate; in
others, not.

So where does this leave us? It may often mean that questions about
importance can not be answered -- at least not in the sense that
they were asked. Unfortunately, regression seems to have been "sold"
to many as a way to answer all such questions. It can't.

<End sermon>

On Sep 3, 10:41 am, (Ted Harding) <ted.hard...@nessie.mcc.ac.uk>
wrote:

> E-Mail: (Ted Harding) <ted.hard...@nessie.mcc.ac.uk>

Ted Harding

unread,
Sep 4, 2007, 3:16:19 AM9/4/07
to MedS...@googlegroups.com
Hi Ray,
Many thanks for the sermon. I did not fall asleep!

I quite agree with the technical points you make.
These were my concerns too, in the face of the general
(and apparently not very well specified) question of
"importance".

On a couple of points in your reply;

On 04-Sep-07 03:36:28, Ray Koopman wrote:
>
> <Begin sermon>
>
> There are several properties of regression that are directly
> relevant to questions of the relative "importance" or "impact"
> of predictors but are widely misunderstood:

So it would seem!

> [...]

> 4. [...] However, the definition


> of importance that is implicit in comparisons of the regression
> weights is the expected change in the outcome variable for a unit
> change in the predictor in question, with all other predictors held
> constant. In some situations this definition may be appropriate; in
> others, not.

Indeed. And, for any particular procedure for "identifying the
most important variable", there is an implicit definition in
terms of the properties of the procedure. But these properties
depend on the features of the sample, not to mention how the
sample was obtained (as you also point out).

So, if sonmeone is seeking to "identify the most important
variable", the procedure (if any) should depend on what purpose
they have in seeking this. Hence my query (as a relative
outside observer of the med/epi community) about whether there
is a common understanding and perception of what it means and
what purpose is served.

In short: Why (in general) should someone want to do this?

> So where does this leave us? It may often mean that questions about
> importance can not be answered -- at least not in the sense that
> they were asked. Unfortunately, regression seems to have been "sold"
> to many as a way to answer all such questions. It can't.

This, I fear, seems to be often the case. Indeed, I wonder if
"the sense in which they were asked" has a definite existence.
And whether the reason the question is asked is that it is
"the done thing" to ask it. In short -- is the question understood?

> <End sermon>

Thanks, and best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>


Fax-to-email: +44 (0)870 094 0861

Date: 04-Sep-07 Time: 08:15:58
------------------------------ XFMail ------------------------------

Ted Harding

unread,
Sep 4, 2007, 5:33:16 AM9/4/07
to MedS...@googlegroups.com
I'd like to let people know of the following very
helpful website and reference:

[1]
http://www.tfh-berlin.de/~groemp/rpack.html

(Ulrike Grömper's page on "relative importance"
and her "relaimpo" package for R0

[2]
http://www.tfh-berlin.de/~groemp/downloads/amstat07mayp139.pdf

(her article "Estimators of Relative Importance
in Linear Regression Based on Variance Decomposition"
in the American Statistician of May 2007).

There is a pointer to a follow-up to [2] in [1].

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.h...@nessie.mcc.ac.uk>


Fax-to-email: +44 (0)870 094 0861

Date: 04-Sep-07 Time: 10:33:13
------------------------------ XFMail ------------------------------

Reply all
Reply to author
Forward
0 new messages