variance_inflation_factor in statsmodels... exog and exog_idx

2,045 views
Skip to first unread message

fortozs

unread,
Mar 31, 2015, 8:15:22 PM3/31/15
to pystat...@googlegroups.com
I ma trying to make a covariance matrix in statsmodels with variance inflation factors. So far I have:


import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor


y
, X = dmatrices('endo ~ X1 + X2 + X3', data=df, return_type='dataframe')
variance_inflation_factor
(np.array(X.dropna()),np.array(X.dropna().index))


I get the message:

IndexError: index 7 is out of bounds for axis 1 with size 5

I am probably misunderstanding the usage of variance_inflation_factor. Does anyone know what the proper inputs are for this function? Thanks.

josef...@gmail.com

unread,
Mar 31, 2015, 8:33:13 PM3/31/15
to pystatsmodels
On Tue, Mar 31, 2015 at 8:15 PM, fortozs <for...@gmail.com> wrote:
> I ma trying to make a covariance matrix in statsmodels with variance
> inflation factors.

How do you make a covariance matrix? I don't understand this part

So far I have:
>
>
> import pandas as pd
> import numpy as np
> import statsmodels.api as sm
> from patsy import dmatrices
> from statsmodels.stats.outliers_influence import variance_inflation_factor
>
>
> y, X = dmatrices('endo ~ X1 + X2 + X3', data=df, return_type='dataframe')
> variance_inflation_factor(np.array(X.dropna()),np.array(X.dropna().index))
>
>
> I get the message:
>
> IndexError: index 7 is out of bounds for axis 1 with size 5
>
> I am probably misunderstanding the usage of variance_inflation_factor. Does
> anyone know what the proper inputs are for this function? Thanks.

AFAICS:
variance_inflation_factor is a brute force, non vectorized version.
the index should be a scalar integer.

vif = [variance_inflation_factor(X.values, ix) for ix in range(X.shape[1])]

Note: dmatrices should/will have dropped already missing values, AFAIK


(I found the math for a vectorized version since I wrote
variance_inflation_factor, but it looks like it hasn't made it's way
into statsmodels yet.)


Josef

fortozs

unread,
Mar 31, 2015, 10:05:27 PM3/31/15
to pystat...@googlegroups.com
Sorry, I didn't put that part of the code up there. But, for example, I had done this for a Pearson correlation matrix. Bear in mind, I am a beginner to modeling. i will give your suggestion a try. Thanks.

full = ['X1', 'X2', 'X3']

pearson
= pd.DataFrame(index=full, columns=full)
for i in pearson.index:
   
for j in pearson.columns:
        pearson
.ix[i,j] = pearsonr(df.dropna()][i], df.dropna()][j])[0]

josef...@gmail.com

unread,
Mar 31, 2015, 10:18:49 PM3/31/15
to pystatsmodels
On Tue, Mar 31, 2015 at 10:05 PM, fortozs <for...@gmail.com> wrote:
> Sorry, I didn't put that part of the code up there. But, for example, I had
> done this for a Pearson correlation matrix. Bear in mind, I am a beginner to
> modeling. i will give your suggestion a try. Thanks.
>
> full = ['X1', 'X2', 'X3']
>
> pearson = pd.DataFrame(index=full, columns=full)
> for i in pearson.index:
> for j in pearson.columns:
> pearson.ix[i,j] = pearsonr(df.dropna()][i], df.dropna()][j])[0]


just a general recommendation
if you need the same df.dropna() several times, then you should save
it, otherwise you need to check for missing values and create a new
array each time you call dropna.

pandas has a correlation function or method that does this in one
vectorized operation, if you don't need the p-value from pearsonr.

variance inflation factor require the explicit loop currently.

Josef

fortozs

unread,
Mar 31, 2015, 10:40:09 PM3/31/15
to pystat...@googlegroups.com
Thanks. I am getting a wealth of information out of this. Your method does work for me. VIF is a new concept to me, so I was expecting a factor between each pair of variables rather than only one for each variable. It seems my intercept has a ridiculously high value. I will be more careful with dropna() in the future, and you are correct about dmatrices dropping the missing data for me. I will also keep in mind that Pandas has a Pearson correlation matrix function. I appreciate all your help. You really have a good support operation going here. 

josef...@gmail.com

unread,
Mar 31, 2015, 11:30:19 PM3/31/15
to pystatsmodels
On Tue, Mar 31, 2015 at 10:40 PM, fortozs <for...@gmail.com> wrote:
> Thanks. I am getting a wealth of information out of this. Your method does
> work for me. VIF is a new concept to me, so I was expecting a factor between
> each pair of variables rather than only one for each variable.

vif is a measure for the collinearity of one variable with all the
others, and not between pairs.
So, it's only one vif per variable. see the Wikipedia page


> It seems my
> intercept has a ridiculously high value.

That sounds a bit suspicious for the calculation or the definition of the vif.

IIRC, Some formulas for calculating the vif do not inlcude a vif for
the constant.
http://en.wikipedia.org/wiki/Variance_inflation_factor doesn't define
vif for constant

It should be useful to help detect a second, implicit constant (for
example too many dummy variables, which you cannot or should no get
when using the formulas). I don't think it means much if the other
variables are continuous and non-constant.

I need to go back and check the constant behavior

>>> xt = np.linspace(0, 4 * np.pi)
>>> [variance_inflation_factor(x, idx) for idx in range(3)]
[9007199254740992.0, 1286742750677284.5, 1286742750677284.5]
>>> x = np.column_stack((np.ones(len(xt)), np.sin(xt)**2+ 1e-1*np.random.randn(len(xt)), np.cos(xt)**2))
>>> [variance_inflation_factor(x, idx) for idx in range(3)]
[115.49741177486075, 15.395856373845902, 15.395856373845902]
>>> x = np.column_stack((np.ones(len(xt)), np.sin(xt)**2+ 1.*np.random.randn(len(xt)), np.cos(xt)**2))
>>> [variance_inflation_factor(x, idx) for idx in range(3)]
[4.92952250929378, 1.2859239968676697, 1.285923996867669]

the extreme case is a bit weird (0 instead of inf for constant)

>>> x = np.column_stack((np.ones(len(xt)), np.sin(xt)**2+ 0*np.random.randn(len(xt)), np.cos(xt)**2))
>>> [variance_inflation_factor(x, idx) for idx in range(3)]
[0.0, inf, inf]

>>> np.max(np.abs(x.dot([-1, 1, 1])))
2.2204460492503131e-16



> I will be more careful with
> dropna() in the future, and you are correct about dmatrices dropping the
> missing data for me. I will also keep in mind that Pandas has a Pearson
> correlation matrix function. I appreciate all your help. You really have a
> good support operation going here.

almost, I got distracted before replying to your first thread

fortozs

unread,
Apr 1, 2015, 12:10:57 AM4/1/15
to pystat...@googlegroups.com
It might take me a bit to digest all that, but I will work through what you are trying to show me. In the meantime, I did manage to write some code that (theoretically) can do a stepwise elimination of variables using vif (including a threshold value for vif). Once again, I have a lot to learn before I'm actually comfortable implementing this.

y, X = dmatrices('y ~ X0 + X1 + X2 + X3 + X4', data=df, return_type='dataframe')
thresh
= 5.0
variables
= range(X.shape[1])

for i in np.arange(0,len(variables)):
    vif
= [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]
   
print(vif)
    maxloc
= vif.index(max(vif))
   
if max(vif) > thresh:
       
print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
       
del variables[maxloc]

print('Remaining variables:')
print(X.columns[variables])


Reply all
Reply to author
Forward
0 new messages