multicollinearity, condition number, scaling ?

953 views
Skip to first unread message

josef...@gmail.com

unread,
Feb 7, 2012, 10:48:53 PM2/7/12
to pystat...@googlegroups.com
I'm still working my way through various diagnostic measures, and
result statistics that gretl has.

Does anyone have recommendations about the condition number as a
measure of multicollinearity or as indicator for possible numerical
problems?

I never looked at the condition numbers, and I'm not sure all of it makes sense.

Gretl reports 1-norm (actually it's inverse) without recommendation on
thresholds when to watch out
np.linalg.cond(res_ols.normalized_cov_params, 1)

Wikipedia http://en.wikipedia.org/wiki/Multicollinearity#Detection_of_multicollinearity
uses sqrt of norm-2, sqrt of ratio of largest to smallest eigenvalue
with a recommended threshold of 30

I had left out the sqrt in the summary() rewrite (I don't find another
function anymore that I thought we had)

In some lecture notes for Stata, they use scaled, zscored and demeaned
version with or without intercept (I didn't read all the details yet.)


The unscaled versions with a threshold of 30 look way to sensitive to me.
For the parameter estimation in the linear model we use either pinv or
QR, and I doubt there are numerical problems in this range, but this
is only based on some examples I experimented with.

My questions right now is whether there is a one (or two) measure(s)
for the summary() that users would actually look at, and is
approximately reliable as indicator?
And the other question, is it useful to have a set of measures for
multicollinearity based on correlations, eigenvalues, and condition
numbers for scaled or zscored explanatory variables?

Josef

Skipper Seabold

unread,
Feb 8, 2012, 8:43:15 AM2/8/12
to pystat...@googlegroups.com
On Tue, Feb 7, 2012 at 10:48 PM, <josef...@gmail.com> wrote:
> I'm still working my way through various diagnostic measures, and
> result statistics that gretl has.
>
> Does anyone have recommendations about the condition number as a
> measure of multicollinearity or as indicator for possible numerical
> problems?
>
> I never looked at the condition numbers, and I'm not sure all of it makes sense.
>
> Gretl reports 1-norm (actually it's inverse) without recommendation on
> thresholds when to watch out
> np.linalg.cond(res_ols.normalized_cov_params, 1)
>
> Wikipedia http://en.wikipedia.org/wiki/Multicollinearity#Detection_of_multicollinearity
> uses sqrt of norm-2, sqrt of ratio of largest to smallest eigenvalue
> with a recommended threshold of 30
>

This is what Greene recommends as well. Older 5th edition section 4.9.

> I had left out the sqrt in the summary() rewrite (I don't find another
> function anymore that I thought we had)
>
> In some lecture notes for Stata, they use scaled, zscored and demeaned
> version with or without intercept (I didn't read all the details yet.)
>
>
> The unscaled versions with a threshold of 30 look way to sensitive to me.
> For the parameter estimation in the linear model we use either pinv or
> QR, and I doubt there are numerical problems in this range, but this
> is only based on some examples I experimented with.
>

I agree. I don't think I've ever seen a dataset with a condition
number smaller than 30 in applied work.

> My questions right now is whether there is a one (or two) measure(s)
> for the summary() that users would actually look at, and is
> approximately reliable as indicator?

I'm never *that* concerned about it one way or the other unless I run
into real numerical problems, and then I calculate the condition
number myself. I think it'd be good just to have an example page
demonstrating how to get the conditions number(s) and some references.

Stata doesn't report it IIRC. They have essentially np.linalg.cond in
there matrix programming language mata. They will drop a variable if
near perfect multicollinearity is detected. Chuck had a good response
to this in a thread a while back.

> And the other question, is it useful to have a set of measures for
> multicollinearity based on correlations, eigenvalues, and condition
> numbers for scaled or zscored explanatory variables?

I dunno. I wouldn't think so for day-to-day use, but maybe as a suite
of tools with some references in case there are problems.

Skipper

josef...@gmail.com

unread,
Feb 8, 2012, 9:01:03 AM2/8/12
to pystat...@googlegroups.com

Good, I got annoyed by the warning that the current summary() prints
because it looks to me like false alarms most of the time.

When I suspect numerical problems I usually just check the size of the
smallest eigenvalue.

>
>> My questions right now is whether there is a one (or two) measure(s)
>> for the summary() that users would actually look at, and is
>> approximately reliable as indicator?
>
> I'm never *that* concerned about it one way or the other unless I run
> into real numerical problems, and then I calculate the condition
> number myself. I think it'd be good just to have an example page
> demonstrating how to get the conditions number(s) and some references.
>
> Stata doesn't report it IIRC. They have essentially np.linalg.cond in
> there matrix programming language mata. They will drop a variable if
> near perfect multicollinearity is detected. Chuck had a good response
> to this in a thread a while back.

http://www.nd.edu/~rwilliam/stats2/l11.pdf
has a good discussion, with a user function

>
>> And the other question, is it useful to have a set of measures for
>> multicollinearity based on correlations, eigenvalues, and condition
>> numbers for scaled or zscored explanatory variables?
>
> I dunno. I wouldn't think so for day-to-day use, but maybe as a suite
> of tools with some references in case there are problems.

Ok, I drop the condition number from the summary(), or at least the
warning and recommendation. I would like to still have something that
warns of a singular matrix, e.g. when too many dummy variables are
included.

I have variance inflation factor as a multicollinearity measure, which
has a clearer interpretation. I might leave the rest of a
multicollinearity suite then until we figure out what's really
relevant or how to interpret these measures.

Thanks,

Josef

>
> Skipper

Nathaniel Smith

unread,
Feb 8, 2012, 9:58:07 AM2/8/12
to pystat...@googlegroups.com
On Wed, Feb 8, 2012 at 1:43 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Tue, Feb 7, 2012 at 10:48 PM,  <josef...@gmail.com> wrote:
>> The unscaled versions with a threshold of 30 look way to sensitive to me.
>> For the parameter estimation in the linear model we use either pinv or
>> QR, and I doubt there are numerical problems in this range, but this
>> is only based on some examples I experimented with.
>>
>
> I agree. I don't think I've ever seen a dataset with a condition
> number smaller than 30 in applied work.

I also agree; I Am Not A Numerical Analyst, but as far as I can tell
the threshold of 30 just seems absurd. I would love to see better
references on this, but here's my current understanding:

In double-precision, our data is stored with 53 binary digits of
precision. A condition number of 30 means something like, a perfect
matrix inversion algorithm will leave garbage in the last log2(30) = 5
digits. Of course our algorithms aren't perfect (but they're pretty
good), and it's not pure matrix inversion that we care about exactly,
and there may be accumulation of error across multiple steps (though
in OLS neither of these seem very likely)... but who the heck is
starting out with 53 - 5 = 48 actually meaningful digits of precision
in their statistical data, or anywhere close? I work with reaction
times (~1/1000 accuracy, so ~10 bits) and EEG data (recorded with a 12
bit ADC), and even so the last few digits are pretty meaningless due
to statistical noise.

So say I want a nice safety factor, so I require my algorithms to
preserve 20 good binary digits. Then it seems like I should only worry
if my condition number is >2^(53-20) = ~8.6 x 10^9. I think
realistically the only way this could happen is those degenerate cases
where you only have like 3 data points, or accidentally added
height-in-feet and height-in-meters as separate regressors.

If you want to work in single-precision, then you start out with only
23 bits, and then I'd worry a lot more about checking condition
numbers. Of course this is why working in single-precision is a bad
idea, but with big data people do it anyway to save memory... it's
possible that you *should* report condition number when given
lower-precision ndarrays. Or put some threshold on the difference
between log2(cond(A)) and np.finfo(A.dtype).nmant, and display a
warning based on that.

I am suspicious that the magic threshold of "30" may have been taken
from some context decades ago where people were doing like, physical
simulations with single precision and naive algorithms, and then has
been endlessly copied forward in the statistical literature where our
situation is utterly different.

>> My questions right now is whether there is a one (or two) measure(s)
>> for the summary() that users would actually look at, and is
>> approximately reliable as indicator?

I recommend VIF to people, since it's actually interpretable by
non-experts. VIF = 3 means that you need to gather 3x as much data as
you would if you were using an orthogonal design. (Of course, it's
only defined for OLS...) My favorite reference on collinearity issues
is chapter 13 of:

@book{Fox2008:regression,
address = {Thousand Oaks, California},
edition = {2nd},
title = {Applied regression analysis and generalized linear models},
isbn = {9780761930426},
publisher = {{SAGE}},
author = {John Fox},
year = {2008}
}

The other problem, though, is that -- at least in my field --
absolutely no practitioners do have an accurate understanding of how
collinearity works, so just displaying some numbers doesn't help.
Making statistical software that helps non-experts interpret their
data correctly takes more than just displaying the correct numbers
:-). I'm actually not sure whether VIF is even that useful for a
standard 'summary' method -- it should of course be available for when
you want it, but by the time you're running a regression, I guess it's
only the standard errors that you actually care about? They already
summarize the uncertainty from noise, the uncertainty from
collinearity, etc., all together. (VIF is most meaningful *before* you
run your experiment when you can still do something about it...)

-- Nathaniel

josef...@gmail.com

unread,
Feb 8, 2012, 11:01:20 PM2/8/12
to pystat...@googlegroups.com

Thanks, it's pretty much the impression I have, but I didn't know any
technical details.

>
>>> My questions right now is whether there is a one (or two) measure(s)
>>> for the summary() that users would actually look at, and is
>>> approximately reliable as indicator?
>
> I recommend VIF to people, since it's actually interpretable by
> non-experts. VIF = 3 means that you need to gather 3x as much data as
> you would if you were using an orthogonal design. (Of course, it's
> only defined for OLS...) My favorite reference on collinearity issues
> is chapter 13 of:
>
>  @book{Fox2008:regression,
>         address = {Thousand Oaks, California},
>         edition = {2nd},
>         title = {Applied regression analysis and generalized linear models},
>         isbn = {9780761930426},
>         publisher = {{SAGE}},
>         author = {John Fox},
>         year = {2008}
> }

I was looking at the freely available appendix, but that didn't seem
to be very explicit about various collinearity measures.

>
> The other problem, though, is that -- at least in my field --
> absolutely no practitioners do have an accurate understanding of how
> collinearity works, so just displaying some numbers doesn't help.
> Making statistical software that helps non-experts interpret their
> data correctly takes more than just displaying the correct numbers
> :-). I'm actually not sure whether VIF is even that useful for a
> standard 'summary' method -- it should of course be available for when
> you want it, but by the time you're running a regression, I guess it's
> only the standard errors that you actually care about? They already
> summarize the uncertainty from noise, the uncertainty from
> collinearity, etc., all together. (VIF is most meaningful *before* you
> run your experiment when you can still do something about it...)

I didn't plan VIF for the summary table, since it's not just one
simple number. Currently, I still have it together with influence and
outlier diagnostics. It's available as standalone function so it can
be run before or after the estimation.

I spent large parts of the day writing tests for the diagnostic tests
compared to Gretl. In some cases Gretl is nice because it writes a
comment to measures that don't have a p-value that can be easily
interpreted. SAS also has thresholds for the influence and outlier
measures.
Gretl also automatically adds the results of several diagnostic tests
to the regression summary (at least in the GUI version).

My target is to provide also a comment string in the model results
summary or in the summary for groups of diagnostic tests, as a warning
that there might be problems with the specification of the model, that
hopefully is helpful for less experienced users, that don't know which
battery of tests they should run and interpret.

But I still have some work to do before we get there.

Thanks,

Josef

>
> -- Nathaniel

Reply all
Reply to author
Forward
0 new messages