26 views

Skip to first unread message

Oct 27, 2020, 7:40:15 PM10/27/20

to StatForLing with R

I'm not sure if I should use Variance Inflation Factors (VIF) or Pearson correlation tests to detect and remove collinear predictor variables before running a final linear regression model.

In the following reproducible example, the VIFs are well below the rule-of-thumb strict thresholds of 4 (given by the creators of the olsrr package here) and 5 (mentioned by Levshina 2015, p. 160). However, Pearson correlation tests indicate a significant correlation between some of the predictor variables.

Which should I believe?

<code>

# get some randomized data

set.seed(123)

n <- 10000

aa <- runif(n)

bb <- runif(n)

cc <- runif(n)

dd <- runif(n)

ee <- runif(n)

# make bb correlate with cc and, to a lesser extent, with dd

for (i in 1:n) {

if (i %% 3 == 0) {

bb[i] <- cc[i]

next

}

if (i %% 5 == 0) {

bb[i] <- dd[i]

}

}

# fit linear regression model

df <- data.frame(aa, bb, cc, dd, ee)

m1 <- lm(aa ~ bb + cc + dd + ee, data = df)

# get Variance Inflation Factors from car package

car::vif(m1)

# get correlation coefficients

with(df, cor.test(bb, cc, method = "pearson")) # significantly correlated

with(df, cor.test(bb, dd, method = "pearson")) # significantly correlated

with(df, cor.test(bb, ee, method = "pearson")) # not significant

</code>

Oct 27, 2020, 7:44:52 PM10/27/20

to StatForLing with R

VIF, definitely -- not pairwise correlations. From an awesome forthcoming 3rd edi of a stats book I happen to know of:

How does one detect multicollinearity? A first step can be looking at pairwise correlations of predictors, but

that’s not even close to enough, as I have written in many a review: High pairwise correlations between

predictors are a sufficient condition for multicollinearity, but not a necessary one. Thus, it is not . ever .

enough, period. One better diagnostic is a statistic called variance inflation factors (VIFs). [... example, how to compute them from parts of a model, blah ...] Now, what VIFs measure is [...] that’s multicollinearity. And this

should explain to you why doing only pairwise correlations as a multicollinearity diagnostic is nothing short of

futile. Briefly and polemically: where’s the*multi* in *pair*wise correlations? More usefully: Imagine you have a

model with 10 numeric predictors. Then, the pairwise correlation tester checks whether predictor 1 is collinear

by checking it against the 9 other predictors: 1 & 2, 1 & 3, ..., 1 & 10. But maybe predictor 1 isn’t predictable

by one other predictor, but by the combination of predictors 2, 4, 5, 8, and 9? Or maybe one level of a

categorical predictor is highly predictive of something, which might be missed by checking the correlation of

that categorical predictor with all its levels at the same time. The pairwise approach alone really doesn’t do

much: if you’re worried about collinearity, great, I applaud that!, but if you then only check for it with pairwise

correlations, consider your study an automatic revise and resubmit because then, by definition, the reader

won’t know how reliable your model is [...]

How does one detect multicollinearity? A first step can be looking at pairwise correlations of predictors, but

that’s not even close to enough, as I have written in many a review: High pairwise correlations between

predictors are a sufficient condition for multicollinearity, but not a necessary one. Thus, it is not . ever .

enough, period. One better diagnostic is a statistic called variance inflation factors (VIFs). [... example, how to compute them from parts of a model, blah ...] Now, what VIFs measure is [...] that’s multicollinearity. And this

should explain to you why doing only pairwise correlations as a multicollinearity diagnostic is nothing short of

futile. Briefly and polemically: where’s the

model with 10 numeric predictors. Then, the pairwise correlation tester checks whether predictor 1 is collinear

by checking it against the 9 other predictors: 1 & 2, 1 & 3, ..., 1 & 10. But maybe predictor 1 isn’t predictable

by one other predictor, but by the combination of predictors 2, 4, 5, 8, and 9? Or maybe one level of a

categorical predictor is highly predictive of something, which might be missed by checking the correlation of

that categorical predictor with all its levels at the same time. The pairwise approach alone really doesn’t do

much: if you’re worried about collinearity, great, I applaud that!, but if you then only check for it with pairwise

correlations, consider your study an automatic revise and resubmit because then, by definition, the reader

won’t know how reliable your model is [...]

In your case, Earl, the fact that the correlations are significant doesn't matter: that p-value is partially due to the sample size -- what counts for VIF is something else.

Oct 27, 2020, 8:06:59 PM10/27/20

to StatForLing with R

Thanks for the clarification, and for the preview of that wonderful forthcoming 3rd edition of a stats book. 😉

Oct 27, 2020, 8:08:27 PM10/27/20

to StatForLing with R

:-) and you're welcome

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu