# Issue of Multicollinearity

27 views

### rona kurian

Jan 28, 2024, 4:06:49 AMJan 28

Greetings of the day.

I am a doctoral researcher in the OBHR area. My research is on complexity leadership and acceptance of emergent changes in organizations.

One of my studies for the doctoral thesis includes the use of a combination of dynamic network analysis and response surface methodology, but I am facing the issue of multicollinearity as the independent measures derived from network analysis are related in nature. My independent variables are the clustering coefficient, betweenness centrality, closeness centrality, and average speed. The dependent measure is cognitive demand.

I cannot afford to drop any of the measures from the model. How can I solve the problem? I have come across ridge regression as a possibility, but how can I use ridge regression to test my hypothesis, as it does not provide for the calculation of test statistic and standard error?

I shall be highly obliged if you could kindly suggest a way forward.

Thanks and Regards

Rona Elizabeth Kurian

### Mihovil Bartulović

Jan 29, 2024, 12:45:14 AMJan 29
Dear Rona,

I am not sure if I will be able to coney everything regarding how to deal with multicollinearity but hopefully I will able to give you some good pointers:

First, you need to look how "bad" your multicollinearity is. You can go about this is in several different ways - Variance Inflation Factor, Tolerance (which is inverse of VIF), Condition Index, Eigenvalue analysis, etc

Let's say you use VIF as it is by far most popular method (even though not the most comprehensive one - there are methods by Belsley, Kuh and Welsch that go much more in depth as noted in their book Regression Diagnostics: Identifying Influential Data and Sources of Collinearity which is useful resource albeit rather advanced one)
VIF quantifies how much the variance of an estimated regression coefficient increases if your predictors are correlated. Usually, if VIF is greater than 5  indicates a worrisome amount of collinearity.

If feasible, the easiest method of reducing multicollinearity is increasing your sample size - larger dataset usually means more variance thus likely weaker dependency between our predictors.

As you noted, any regularization technique like Ridge (L2) or Lasso (L1) regression would  resolve the issue of multicollinearity but due to the nature of regularization it poses a challenge when it comes to hypothesis testing. One way to go around this problem (while using Ridge or Lasso regression), is to use bootstrap methods to estimate the distribution of the ridge/lasso regression coefficients.  If you resample your data over and over again and fit a ridge/lasso regression model to each sample. This way you can build up a distribution of each coefficient and then perform hypothesis testing based on this newly obtained empirical distribution.

Another way to deal with this issue is to use principal component analysis - this is a dimensionality reduction technique that transforms your correlated variables into a set of uncorrelated principal components. You can then use these components as the independent variables in your regression model. This will help you with your multicollinearity issue but the new PCA vectors might be hard to interpret as they might not have a direct link to your original predictors.
Similar to PCA, and in your case maybe more applicable is partial least squares regression as it does basically the same thing as PCA but the principal component is as similar as possible to the predictor as  the components are chosen to maximize the covariance with the predictor variables.

There are some other advanced statistical techniques that , if I'm not mistaken are discussed in the book I mentioned, that I am not too familiar with.

Lastly, even though I hope no-one from statistics department is reading this part, its worth mentioning that when it comes to applied work, in some cases moving from strict hypothesis testing to the estimation and interpretation of effect sizes might be a valid way forward. For example, ridge regression provides biased output, but potentially better, more understandable, and generalizable estimates of the relationships in your dataset. Looking at these estimates, discussing them, along with their limitations, can be a valid approach in many research situations.

Hope this helps,

Mihovil