126 views

Skip to first unread message

Sep 3, 2007, 1:41:58 PM9/3/07

to MedS...@googlegroups.com

Hi Folks,

I'm being asked some questions about "identifying

the most important variable" in a multiple regression.

This is partly because I'm in a muddle about what

"most important" is supposed to mean -- If I were

clear about that, then I could work it out!

There seems to be a variety of interpretations

and usages of "most important variable" in the

literature.

It's partly also because of the variety of procedures

which are employed for "identifying" it. E.g.

a) The variable with the smallest P-value

b) The variable which makes the most change to R^2

when it is left out of the model

c) The variable which produces the smallest residuals

when the variables are fitted singly (i.e. by

a simple regression on one variable at a time)

...

I can produce counter-examples to common sense for

most interpretations.

For example, variables X and Y influence an outcome Z

linearly (and positively).

Variable Y has a higher mean level in the population

than X (so the Z level in the population is mainly

contributed to by Y).

Variable Y is more potent than X: its (positive)

coefficient is greater than X's (so changes in Y

produce bigger changes in Z than do changes in X).

Yet, on the "change in R^2" criterion, Variable X

is "more important" than Variable Y (because the

population Standard Deviation of X is much greater

than that of Y, so it accounts for most of the

variation in Z).

Not forgetting that when you adopt a "leave-one-out"

approach, and you have covariates which are seriously

correlated with each other, then variables will have

"importance" in common.

And not forgetting, either, that when you are looking

at the results of a multiple regression for this purpose,

the coefficients, P-values, and what-not will depend

in general on the order of terms in the model, and also

on the system of contrasts used to generate the coefficient

estimates.

One question I have is the following. The prevalence of

"most important variable" in the bio/med/epi literature

and discourse suggests that it is a "common currency"

term in that community. But I have yet to come across

a definitive statement of what it is about. Is there,

therefore, a common and uniform understanding in that

community of what it is?

If so, what -- in formal and precise terms -- is that

understanding?

Another question: What authoritative literature might

one consult for a good discussion of "importance"?

Frankly, this strikes me as a very murky area, and

I would be reluctant to "bless" any particular approach

or interpretation unless I was sure of what was meant.

With thanks, and best wishes to all,

Ted.

--------------------------------------------------------------------

E-Mail: (Ted Harding) <ted.h...@nessie.mcc.ac.uk>

Fax-to-email: +44 (0)870 094 0861

Date: 03-Sep-07 Time: 18:41:03

------------------------------ XFMail ------------------------------

Sep 3, 2007, 2:01:21 PM9/3/07

to MedS...@googlegroups.com

You can find an annotated bibliography of the literature

into the late 1990's at:

into the late 1990's at:

http://www.nuff.ox.ac.uk/sociology/alcd/relimp.pdf

Rich Goldstein

Sep 3, 2007, 2:23:03 PM9/3/07

to MedS...@googlegroups.com

On 03-Sep-07 18:01:21, Richard Goldstein wrote:

>

> You can find an annotated bibliography of the literature

> into the late 1990's at:

>

> http://www.nuff.ox.ac.uk/sociology/alcd/relimp.pdf

>

> Rich Goldstein

>

> You can find an annotated bibliography of the literature

> into the late 1990's at:

>

> http://www.nuff.ox.ac.uk/sociology/alcd/relimp.pdf

>

> Rich Goldstein

Thanks very much indeed for this comprehensive-looking review!

Having looked through it, I feel comforted -- especially by

the final item where the quotation of E.J. Williams' comment:

Dismissive of variable importance assessments through

partitioning of eĆects, except when variables are orthogonal.

"In general the only realistic interpretation of a

regression relation is that the dependent variable

is subject to the combined effect of a number of

variables. Any attempt to take the interpretation

further [by partitioning] can lead only to

misinterpretation and confusion."

very precisely matches the view I seem to be approaching myself

(in particular regarding the orthogonality aspect)!

Best wishes,

Ted.

Date: 03-Sep-07 Time: 19:23:00

------------------------------ XFMail ------------------------------

Sep 3, 2007, 11:36:28 PM9/3/07

to MedStats

<Begin sermon>

There are several properties of regression that are directly

relevant to questions of the relative "importance" or "impact"

of predictors but are widely misunderstood:

1. First and foremost, all the variables that truly matter must be

present in the regression equation. If any important variables are

omitted then the results can be misleading unless all the omitted

variables are uncorrelated with all the included variables. There

is no point in attempting to discover the relative importance of

some predictors for which you have data unless you already know

that these are the only predictors that matter.

2. R^2 can be partitioned into components representing the unique

contribution of each predictor only when all the predictors are

mutually uncorrelated. The problem is not partitioning R^2 -- that

can always be done. The problem is that the results do not always

represent the unique contribution of each predictor. However

intuitively straightforward the notion of unique contributions

may seem, there is no mathematical definition that is entirely

satisfactory when the predictors are correlated.

3. Importance ratings obtained by comparing semipartial correlations

or changes in R^2 (i.e., squared semipartials) depend on the joint

distribution of the predictors. Contrary to what is often implict in

the importance question, the results are not inherent properties of

the variables alone, but joint properties of the variables and the

particular multivariate distribution they happen to have. This is

especially important when the distribution of the predictors is an

artifact -- either a direct artifact, because the investigator set

the values of the predictors; or an indirect artifact, because the

investigator selected cases or sampled nonrandomly. And even if the

sample distribution is a valid estimate of some "natural" population

distribution, if the population distribution changes then the true

semipartials can also change, even though the mechanism relating the

predictors to the outcome variable has not changed.

4. If the predictors are in the same units (possibly after data-

independent unit-equating transformations, which excludes sample-

specific standardization), then comparing the raw-score regression

weights can lead to conclusions of relative importance that are

inherent properties of the variables alone. However, the definition

of importance that is implicit in comparisons of the regression

weights is the expected change in the outcome variable for a unit

change in the predictor in question, with all other predictors held

constant. In some situations this definition may be appropriate; in

others, not.

So where does this leave us? It may often mean that questions about

importance can not be answered -- at least not in the sense that

they were asked. Unfortunately, regression seems to have been "sold"

to many as a way to answer all such questions. It can't.

<End sermon>

On Sep 3, 10:41 am, (Ted Harding) <ted.hard...@nessie.mcc.ac.uk>

wrote:

> E-Mail: (Ted Harding) <ted.hard...@nessie.mcc.ac.uk>

Sep 4, 2007, 3:16:19 AM9/4/07

to MedS...@googlegroups.com

Hi Ray,

Many thanks for the sermon. I did not fall asleep!

Many thanks for the sermon. I did not fall asleep!

I quite agree with the technical points you make.

These were my concerns too, in the face of the general

(and apparently not very well specified) question of

"importance".

On a couple of points in your reply;

On 04-Sep-07 03:36:28, Ray Koopman wrote:

>

> <Begin sermon>

>

> There are several properties of regression that are directly

> relevant to questions of the relative "importance" or "impact"

> of predictors but are widely misunderstood:

So it would seem!

> [...]

> 4. [...] However, the definition

> of importance that is implicit in comparisons of the regression

> weights is the expected change in the outcome variable for a unit

> change in the predictor in question, with all other predictors held

> constant. In some situations this definition may be appropriate; in

> others, not.

Indeed. And, for any particular procedure for "identifying the

most important variable", there is an implicit definition in

terms of the properties of the procedure. But these properties

depend on the features of the sample, not to mention how the

sample was obtained (as you also point out).

So, if sonmeone is seeking to "identify the most important

variable", the procedure (if any) should depend on what purpose

they have in seeking this. Hence my query (as a relative

outside observer of the med/epi community) about whether there

is a common understanding and perception of what it means and

what purpose is served.

In short: Why (in general) should someone want to do this?

> So where does this leave us? It may often mean that questions about

> importance can not be answered -- at least not in the sense that

> they were asked. Unfortunately, regression seems to have been "sold"

> to many as a way to answer all such questions. It can't.

This, I fear, seems to be often the case. Indeed, I wonder if

"the sense in which they were asked" has a definite existence.

And whether the reason the question is asked is that it is

"the done thing" to ask it. In short -- is the question understood?

> <End sermon>

Thanks, and best wishes,

Ted.

--------------------------------------------------------------------

E-Mail: (Ted Harding) <Ted.H...@manchester.ac.uk>

Fax-to-email: +44 (0)870 094 0861

Date: 04-Sep-07 Time: 08:15:58

------------------------------ XFMail ------------------------------

Sep 4, 2007, 5:33:16 AM9/4/07

to MedS...@googlegroups.com

I'd like to let people know of the following very

helpful website and reference:

helpful website and reference:

[1]

http://www.tfh-berlin.de/~groemp/rpack.html

(Ulrike Grömper's page on "relative importance"

and her "relaimpo" package for R0

[2]

http://www.tfh-berlin.de/~groemp/downloads/amstat07mayp139.pdf

(her article "Estimators of Relative Importance

in Linear Regression Based on Variance Decomposition"

in the American Statistician of May 2007).

There is a pointer to a follow-up to [2] in [1].

Best wishes to all,

Ted.

--------------------------------------------------------------------

E-Mail: (Ted Harding) <ted.h...@nessie.mcc.ac.uk>

Fax-to-email: +44 (0)870 094 0861

Date: 04-Sep-07 Time: 10:33:13

------------------------------ XFMail ------------------------------

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu