Which method for detecting (multi)collinearity should be followed?

26 views

ekbrown77

Oct 27, 2020, 7:40:15 PM10/27/20
to StatForLing with R
I'm not sure if I should use Variance Inflation Factors (VIF) or Pearson correlation tests to detect and remove collinear predictor variables before running a final linear regression model.

In the following reproducible example, the VIFs are well below the rule-of-thumb strict thresholds of 4 (given by the creators of the olsrr package here) and 5 (mentioned by Levshina 2015, p. 160). However, Pearson correlation tests indicate a significant correlation between some of the predictor variables.

Which should I believe?

<code>
# get some randomized data
set.seed(123)
n <- 10000
aa <- runif(n)
bb <- runif(n)
cc <- runif(n)
dd <- runif(n)
ee <- runif(n)

# make bb correlate with cc and, to a lesser extent, with dd
for (i in 1:n) {
if (i %% 3 == 0) {
bb[i] <- cc[i]
next
}
if (i %% 5 == 0) {
bb[i] <- dd[i]
}
}

# fit linear regression model
df <- data.frame(aa, bb, cc, dd, ee)
m1 <- lm(aa ~ bb + cc + dd + ee, data = df)

# get Variance Inflation Factors from car package
car::vif(m1)

# get correlation coefficients
with(df, cor.test(bb, cc, method = "pearson"))  # significantly correlated
with(df, cor.test(bb, dd, method = "pearson"))  # significantly correlated
with(df, cor.test(bb, ee, method = "pearson"))  # not significant
</code>

Stefan Th. Gries

Oct 27, 2020, 7:44:52 PM10/27/20
to StatForLing with R
VIF, definitely -- not pairwise correlations. From an awesome forthcoming 3rd edi of a stats book I happen to know of:

How does one detect multicollinearity? A first step can be looking at pairwise correlations of predictors, but
that’s not even close to enough, as I have written in many a review: High pairwise correlations between
predictors are a sufficient condition for multicollinearity, but not a necessary one. Thus, it is not . ever .
enough, period. One better diagnostic is a statistic called variance inflation factors (VIFs). [... example, how to compute them from parts of a model, blah ...]  Now, what VIFs measure is [...] that’s multicollinearity. And this
should explain to you why doing only pairwise correlations as a multicollinearity diagnostic is nothing short of
futile. Briefly and polemically: where’s the multi in pairwise correlations? More usefully: Imagine you have a
model with 10 numeric predictors. Then, the pairwise correlation tester checks whether predictor 1 is collinear
by checking it against the 9 other predictors: 1 & 2, 1 & 3, ..., 1 & 10. But maybe predictor 1 isn’t predictable
by one other predictor, but by the combination of predictors 2, 4, 5, 8, and 9? Or maybe one level of a
categorical predictor is highly predictive of something, which might be missed by checking the correlation of
that categorical predictor with all its levels at the same time. The pairwise approach alone really doesn’t do
much: if you’re worried about collinearity, great, I applaud that!, but if you then only check for it with pairwise
correlations, consider your study an automatic revise and resubmit because then, by definition, the reader
won’t know how reliable your model is [...]

In your case, Earl, the fact that the correlations are significant doesn't matter: that p-value is partially due to the sample size -- what counts for VIF is something else.

ekbrown77

Oct 27, 2020, 8:06:59 PM10/27/20
to StatForLing with R
Thanks for the clarification, and for the preview of that wonderful forthcoming 3rd edition of a stats book. 😉