On Tue, Mar 31, 2015 at 10:40 PM, fortozs <
for...@gmail.com> wrote:
> Thanks. I am getting a wealth of information out of this. Your method does
> work for me. VIF is a new concept to me, so I was expecting a factor between
> each pair of variables rather than only one for each variable.
vif is a measure for the collinearity of one variable with all the
others, and not between pairs.
So, it's only one vif per variable. see the Wikipedia page
> It seems my
> intercept has a ridiculously high value.
That sounds a bit suspicious for the calculation or the definition of the vif.
IIRC, Some formulas for calculating the vif do not inlcude a vif for
the constant.
http://en.wikipedia.org/wiki/Variance_inflation_factor doesn't define
vif for constant
It should be useful to help detect a second, implicit constant (for
example too many dummy variables, which you cannot or should no get
when using the formulas). I don't think it means much if the other
variables are continuous and non-constant.
I need to go back and check the constant behavior
>>> xt = np.linspace(0, 4 * np.pi)
>>> [variance_inflation_factor(x, idx) for idx in range(3)]
[9007199254740992.0, 1286742750677284.5, 1286742750677284.5]
>>> x = np.column_stack((np.ones(len(xt)), np.sin(xt)**2+ 1e-1*np.random.randn(len(xt)), np.cos(xt)**2))
>>> [variance_inflation_factor(x, idx) for idx in range(3)]
[115.49741177486075, 15.395856373845902, 15.395856373845902]
>>> x = np.column_stack((np.ones(len(xt)), np.sin(xt)**2+ 1.*np.random.randn(len(xt)), np.cos(xt)**2))
>>> [variance_inflation_factor(x, idx) for idx in range(3)]
[4.92952250929378, 1.2859239968676697, 1.285923996867669]
the extreme case is a bit weird (0 instead of inf for constant)
>>> x = np.column_stack((np.ones(len(xt)), np.sin(xt)**2+ 0*np.random.randn(len(xt)), np.cos(xt)**2))
>>> [variance_inflation_factor(x, idx) for idx in range(3)]
[0.0, inf, inf]
>>> np.max(np.abs(x.dot([-1, 1, 1])))
2.2204460492503131e-16
> I will be more careful with
> dropna() in the future, and you are correct about dmatrices dropping the
> missing data for me. I will also keep in mind that Pandas has a Pearson
> correlation matrix function. I appreciate all your help. You really have a
> good support operation going here.
almost, I got distracted before replying to your first thread