Condition Number Calculation

Peter B

unread,

Nov 20, 2023, 12:20:59 PM11/20/23

to pystatsmodels

Hi

https://www.statsmodels.org/0.8.0/examples/notebooks/generated/ols.html provides an example derivation of the condition number for OLS (under the Multicollinearity sub heading).

The following code

from statsmodels.datasets.longley import load_pandas y = load_pandas().endog X = load_pandas().exog X = sm.add_constant(X)
# Fit and summary: ols_model = sm.OLS(y, X) ols_results = ols_model.fit() print(ols_results.summary())
# Normalize the independent variables to have unit length: norm_x = X.values for i, name in enumerate(X): if name == "const": continue norm_x[:,i] = X[name]/np.linalg.norm(X[name]) norm_xtx = np.dot(norm_x.T,norm_x) # Print Condition Number Value eigs = np.linalg.eigvals(norm_xtx) condition_number = np.sqrt(eigs.max() / eigs.min()) print(condition_number)
The OLS Regression Results table shows a condition No value of 4.86e+09.
The calculation yields 56240.8714071
Is the calculation for generating the condition number in the example code incorrect?
Thanks for any insights
Pete

josef...@gmail.com

unread,

Nov 20, 2023, 12:55:06 PM11/20/23

to pystat...@googlegroups.com

looks correct to me.

Note, there are different condition numbers depending on scaling.

1) The `summary()` reports the raw condition number, ie. condition number of exog without rescaling.

This is relevant for numerical accuracy of the linear algebra computation for the given exog.

It is not only affected by collinearity but also by bad scaling of the variables.

2) The multicollinearity literature like Belsley Kuh Welsh use the condition number of the norm scaled exog which is the one described in the notebook.

This does not alert the user if their actual exog is badly scaled. It's more a measure of inherent multicollinearity.

3) Another branch of the multicollinearity literature defines multicollinearity for standardized exog (drop constant, demean and scale).

This additionally does not detect problems with the constant, for example singular exog because of the dummy variable trap.

Each measure alerts to different effects.

The choice for `summary` was to alert to problems with the actual `exog` provided by the user.

I have a pull request for general multicollinearity measures, but it stalled because I could not decide how to separate out or combine those 3 different versions of these collinearity measures.

https://github.com/statsmodels/statsmodels/pull/2380

Josef

--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/91258213-b608-4e5d-92e3-561dc5fb8473n%40googlegroups.com.

Peter B

unread,

Nov 20, 2023, 2:29:43 PM11/20/23

to pystatsmodels

Hi Josef

Thanks for the prompt reply. That is very clear..

Kind regards

Pete

Reply all

Reply to author

Forward