Re: [pystatsmodels] Regression with Singular Exogenous Variable Matrix

140 views
Skip to first unread message

Skipper Seabold

unread,
Aug 6, 2012, 1:21:28 PM8/6/12
to pystat...@googlegroups.com
On Mon, Aug 6, 2012 at 12:58 PM, Damien Moore <damien...@gmail.com> wrote:
> Hi,
>
> I'm new to statsmodels. Have been using scipy/numpy for quite a while and
> finally had the need to do some regression modeling. I haven't used
> statsmodels extensively, but so far my experience with it has been very
> nice. Nonetheless, I have a question about the behavior of statsmodels
> regression fit routines when the exogenous variable matrix is singular. This
> topic has come up in the past (specifically
> https://groups.google.com/d/topic/pystatsmodels/AZXNfPcq_HI/discussion).
>
> In statsmodels, I think one of two things happens when a singular matrix is
> passed as the exogenous variables to an ols/glm/discretemodel regression
> fit:
>
> a) if the parameters are estimated with 'pinv', we get a solution that is
> hard to interpret, and no warning about singular X

There is actually a warning in the regression models.

from statsmodels.formula.api import glm, ols
import pandas

x = pandas.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
x["gpa2"] = 3*x.gpa + 2*x.gre

mod = ols("admit ~ gpa2 + gre + gpa + C(rank)", x).fit()

print mod.summary()

But not in GLM.

mod = glm("admit ~ gpa2 + gre + gpa + C(rank)", x,
family=sm.families.Binomial()).fit()

print mod.summary()


> b) the solver fails spitting out a traceback related to the singular matrix
>
> In contrast, R and many other packages, will effectively drop columns until
> the system is singular (and report n/a for the parameter on the column that
> was dropped). I recall that R uses a nice QR decomp trick to do this
> efficiently. See below for an example session with output. Another package
> I've used, but can't recall which actually spits out the restriction
> equation that relates the unidentified parameters to the identified
> parameters as part of the output -- I think this also drops out of the QR
> decomp (or perhaps some other decomp).
>
> So my question is, wouldn't it be better, as the default behavior, to do
> something more like R does, by replacing redundant (unidentified) parameters
> with, say, "nan" plus issue a warning? (Even nicer if this was consistent
> across all of the fit routines.)
>

I think this is likely to be a better solution, though sometimes we'll
want to allow singular designs. I'll look into it, though pull
requests are welcome. Probably can be added to this existing ticket.

https://github.com/statsmodels/statsmodels/issues/146


> Regards,
> Damien
>
> Here's the sample R session:
>
>> mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
>> mydata$rank <- factor(mydata$rank)
>> mydata$gpa2 <- 3*mydata$gpa + 2*mydata$gre
>> mylogit <- glm(admit ~ gpa2 + gre + gpa + rank, data = mydata, family =
>> "binomial")
>> summary(mylogit)
>
> Call:
> glm(formula = admit ~ gpa2 + gre + gpa + rank, family = "binomial",
> data = mydata)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -1.6268 -0.8662 -0.6388 1.1490 2.0790
>
> Coefficients: (1 not defined because of singularities)
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -3.9900 1.1400 -3.500 0.000465 ***
> gpa2 0.2680 0.1106 2.423 0.015388 *
> gre -0.5338 0.2216 -2.409 0.016006 *
> gpa NA NA NA NA
> rank2 -0.6754 0.3165 -2.134 0.032829 *
> rank3 -1.3402 0.3453 -3.881 0.000104 ***
> rank4 -1.5515 0.4178 -3.713 0.000205 ***
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> (Dispersion parameter for binomial family taken to be 1)
>
> Null deviance: 499.98 on 399 degrees of freedom
> Residual deviance: 458.52 on 394 degrees of freedom
> AIC: 470.52
>
> Number of Fisher Scoring iterations: 4
>
>
>

Nathaniel Smith

unread,
Aug 6, 2012, 1:21:54 PM8/6/12
to pystat...@googlegroups.com

IIRC, R uses a rank-revealing QR as the core algorithm here, and this produces practical difficulties because LAPACK does not implement this - you have to ship your own. Theirs is a fortran implementation from LINPACK, again IIRC.

I also like the behavior you're talking about, and I think it's pretty straightforward to implement by using the SVD directly and checking for and handling null singular values explicitly. (So computationally this is similar to using pinv, but with tweaked internals.)

-n

On 6 Aug 2012 18:05, "Damien Moore" <damien...@gmail.com> wrote:
Hi,

I'm new to statsmodels. Have been using scipy/numpy for quite a while and finally had the need to do some regression modeling. I haven't used statsmodels extensively, but so far my experience with it has been very nice. Nonetheless, I have a question about the behavior of statsmodels regression fit routines when the exogenous variable matrix is singular. This topic has come up in the past (specifically https://groups.google.com/d/topic/pystatsmodels/AZXNfPcq_HI/discussion).

In statsmodels, I think one of two things happens when a singular matrix is passed as the exogenous variables to an ols/glm/discretemodel regression fit:

a) if the parameters are estimated with 'pinv', we get a solution that is hard to interpret, and no warning about singular X
b) the solver fails spitting out a traceback related to the singular matrix

In contrast, R and many other packages, will effectively drop columns until the system is singular (and report n/a for the parameter on the column that was dropped). I recall that R uses a nice QR decomp trick to do this efficiently. See below for an example session with output. Another package I've used, but can't recall which actually spits out the restriction equation that relates the unidentified parameters to the identified parameters as part of the output -- I think this also drops out of the QR decomp (or perhaps some other decomp).

So my question is, wouldn't it be better, as the default behavior, to do something more like R does, by replacing redundant (unidentified) parameters with, say, "nan" plus issue a warning?  (Even nicer if this was consistent across all of the fit routines.)

Skipper Seabold

unread,
Aug 6, 2012, 1:24:24 PM8/6/12
to pystat...@googlegroups.com
On Mon, Aug 6, 2012 at 1:21 PM, Nathaniel Smith <n...@pobox.com> wrote:
> IIRC, R uses a rank-revealing QR as the core algorithm here, and this
> produces practical difficulties because LAPACK does not implement this - you
> have to ship your own. Theirs is a fortran implementation from LINPACK,
> again IIRC.
>

I'm not sure, but can't you do this with scipy.linalg.qr and pivoting?
Only in newer scipy.

Nathaniel Smith

unread,
Aug 6, 2012, 2:23:48 PM8/6/12
to pystat...@googlegroups.com
On Mon, Aug 6, 2012 at 6:24 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Mon, Aug 6, 2012 at 1:21 PM, Nathaniel Smith <n...@pobox.com> wrote:
>> IIRC, R uses a rank-revealing QR as the core algorithm here, and this
>> produces practical difficulties because LAPACK does not implement this - you
>> have to ship your own. Theirs is a fortran implementation from LINPACK,
>> again IIRC.
>>
>
> I'm not sure, but can't you do this with scipy.linalg.qr and pivoting?
> Only in newer scipy.

Huh, looks like my IIRC was perhaps not so C.

-n

josef...@gmail.com

unread,
Aug 6, 2012, 2:24:17 PM8/6/12
to pystat...@googlegroups.com

I'm not very much in favor of having an "automatic" solution to the multicollinearity/singularity problem, except as additional options.

Some issues:

when is a matrix rank deficient, multicollinearity is a problem much earlier than "obviously" singular
http://old.nabble.com/Matrix-rank-default-tolerance---is-it-too-low--tt34015756.html#a34015756
http://jpktd.blogspot.fr/2012/03/numerical-accuracy-in-linear-least.html

whether a singular matrix matters depends on the usage
Matthews was arguing that we can use (estimable) contrast matrices for testing that take care of the singularity, pinv estimated parameters.
I have cases where I only care about the fit, not about the interpretation of individual parameters,  and it was nice that I didn't have to special case the problematic design matrices.

Is it clear which params should be "NaNed" if there are some perfectly correlated variables?  In my opinion it should be the decision of the user what to to about (near-) singularities and we only warn.
And the responsibility of the user to use or not to use singular coding for categorical variables, and decide what to do with a general very high multicollinearity that messes up the parameter estimates.

As in https://github.com/statsmodels/statsmodels/issues/146  I think we need more warnings, checking, options, and we need a streamlined linear algebra that can take care of this without duplicate checks and costly calculations.

Also in some cases we just get exceptions just raised somewhere in the calculations.

Josef


josef...@gmail.com

unread,
Aug 6, 2012, 3:25:50 PM8/6/12
to pystat...@googlegroups.com
On Mon, Aug 6, 2012 at 1:21 PM, Nathaniel Smith <n...@pobox.com> wrote:

IIRC, R uses a rank-revealing QR as the core algorithm here, and this produces practical difficulties because LAPACK does not implement this - you have to ship your own. Theirs is a fortran implementation from LINPACK, again IIRC.

I also like the behavior you're talking about, and I think it's pretty straightforward to implement by using the SVD directly and checking for and handling null singular values explicitly. (So computationally this is similar to using pinv, but with tweaked internals.)

Can you really use SVD for this, it will just give you the eigenvectors (analogous, since I'm better with eigenvalue decomposition) associated with the singular values? How do you select which variable(s) to drop, set to NaN?

 (I had started to look at replacing pinv and working with SVD directly, but didn't get very far.)

Josef

Damien Moore

unread,
Aug 6, 2012, 3:45:03 PM8/6/12
to pystat...@googlegroups.com
Just on the math I was under the impression it was something like this...

(pseudocode/handwaving)
Q,R = qr(X)
Z = rows of R that are indistinguishable from zero (based on some threshold)
Ri = R[~Z,~Z]
Qi = Q[:,~Z]
Ru = R[~Z,Z]

solve for the identified coefficients using Ri and Qi in place of R,Q
solve the relationships between unidentified and identified by solving for v in: Ri v = -Ru  (v will be a matrix of coefficients on the identified parameters)

Would still argue for making this the default behavior with a warning (and option to switch it off) as an assist to novice/lazy users.


>I have cases where I only care about the fit, not about the interpretation of individual parameters,  and it was nice that I didn't have to special case the problematic design matrices.

Agree that this is useful behavior. (Suspect that there is a straightforward enough way to recover the pinv parameters from the NaNed parameters, or vice versa...)

josef...@gmail.com

unread,
Aug 7, 2012, 2:54:49 AM8/7/12
to pystat...@googlegroups.com

To elaborate some points in my view:

I would love to get singular design matrix detection as a by-product of the calculations, so we can offer options what to do in this case without having to do a costly check by default.

This would be relatively easy with the linear model, and RLM and GLM since they reuse WLS. But I don't know how the here proposed version would work, svd would only detect  the singular cases, AFAISee.
http://jpktd.blogspot.fr/2012/06/qr-and-sequential-regression-with-fixed.html

For discrete models or models that just use non-linear optimization like maximum likelihood estimators, I don't see a possibility for detection as a by-product. The time when it shows up is when the Hessian is not invertible, and there might also be other numerical problems that result in a non-invertible Hessian.


Josef
 

Josef



josef...@gmail.com

unread,
Aug 7, 2012, 2:56:54 AM8/7/12
to pystat...@googlegroups.com
On Mon, Aug 6, 2012 at 3:45 PM, Damien Moore <damien...@gmail.com> wrote:
Just on the math I was under the impression it was something like this...

(pseudocode/handwaving)
Q,R = qr(X)
Z = rows of R that are indistinguishable from zero (based on some threshold)
Ri = R[~Z,~Z]
Qi = Q[:,~Z]
Ru = R[~Z,Z]

solve for the identified coefficients using Ri and Qi in place of R,Q
solve the relationships between unidentified and identified by solving for v in: Ri v = -Ru  (v will be a matrix of coefficients on the identified parameters)

It didn't work for me in some brief experiments I did before with the singular case and QR. My impression is that the components after the redundant variable are not correct for this.

If anyone gets this to work, it would be a nice addition.

Josef

 

Damien Moore

unread,
Aug 7, 2012, 8:58:46 PM8/7/12
to pystat...@googlegroups.com


On Tuesday, August 7, 2012 2:56:54 AM UTC-4, josefpktd wrote:
It didn't work for me in some brief experiments I did before with the singular case and QR. My impression is that the components after the redundant variable are not correct for this.

If anyone gets this to work, it would be a nice addition.

Josef



I put some code up on GitHub that makes a start. First time using GitHub so bear with me if something went wrong. Here's the git link: git://github.com/spillz/sci-comp.git

The script qr_fit.py solves a simple OLS regression with 3 of the 6 exog variables perfectly correlated. You will see that the approach of dropping rows/columns of the zero diagonal entries of R from the full QR results in a fit that doesn't quite match the parameters if you manually drop the columns before doing the QR. I guess this is because the zero diagonal entries are not exactly zero resulting in approximation errors, but I might just have the procedure completely screwed up :)

I think I have this roughly right, though I must confess that I haven't actually worked through the math, mostly used trial and error. The code can obviously be made a lot more efficient (and general purpose) but that can wait until it is actually correct.


josef...@gmail.com

unread,
Aug 8, 2012, 6:29:20 AM8/8/12
to pystat...@googlegroups.com
On Tue, Aug 7, 2012 at 8:58 PM, Damien Moore <damien...@gmail.com> wrote:


On Tuesday, August 7, 2012 2:56:54 AM UTC-4, josefpktd wrote:
It didn't work for me in some brief experiments I did before with the singular case and QR. My impression is that the components after the redundant variable are not correct for this.

If anyone gets this to work, it would be a nice addition.

Josef



I put some code up on GitHub that makes a start. First time using GitHub so bear with me if something went wrong. Here's the git link: git://github.com/spillz/sci-comp.git

The script qr_fit.py solves a simple OLS regression with 3 of the 6 exog variables perfectly correlated. You will see that the approach of dropping rows/columns of the zero diagonal entries of R from the full QR results in a fit that doesn't quite match the parameters if you manually drop the columns before doing the QR. I guess this is because the zero diagonal entries are not exactly zero resulting in approximation errors, but I might just have the procedure completely screwed up :)

I get the same results with my original script. In my case the percentage errors in the params was worse, because I had some true coefficients equal to zero (after the singular column).
My impression was that the column for the singular column is just noise which affects the later columns.

My conclusion was also that if we don't have a qr algorithm that automatically drops a variable that is a linear combination of the earlier ones, then running a second qr on the reduced X matrix is the only way to go, conditional on the almost zero entries in the r diagonal.

One question is how small should the threshold for the r diagonal be. In my example
>>> np.min(np.abs(np.diag(r)))
8.9373761050510275e-16
>>> np.min(np.abs(np.diag(r**2)))
7.987669164313708e-31

Should this be relative to the scale of X ?
 

I think I have this roughly right, though I must confess that I haven't actually worked through the math, mostly used trial and error. The code can obviously be made a lot more efficient (and general purpose) but that can wait until it is actually correct.


The way you calculate the singular vectors (linear combinations) is nice, I never  thought about that.

It pretty much agrees with my understanding, which is mostly trial and error and some rough intuition (and taking hints from Charles Harris.)
https://github.com/josef-pkt/statsmodels/commits/sequential_ols
is my main experimental branch to a more efficient linear algebra for some use cases.


some thought were to go (Warning: I'm talking too much on mailing lists)

method='qr' needs fixing for singular exog

>>> OLS(y, x).fit(method='qr').params
array([  1.11209641e+15,  -1.11209641e+15,  -8.29023032e-02,
        -5.91013477e-02])
>>> OLS(y, x).fit(method='pinv').params
array([ 1.02385541,  1.02385541, -0.097463  , -0.0719577 ])

option: raise informative exception:
(without qr that can automatically drop variables)
checking the diagonal of r for almost zeros: If any, then raise exception with information about the singular linear combinations.

that sounds like an easy short term solution

option: drop variables automatically

the main implementation problem of OLS (linear model) is the nan propagation in the results.
If we replace an unidentified params with nan, then later calculations will break because they are not designed to handle nans.  f_test, t_test, ... ?
If we just reduce the parameter space, then we could also just return a new regression instance with the "offending" columns dropped.  Problem is how do we get the information to the user which variables are still included (should be ok if the data came from a pandas dataframe, or has full column_names).
A third possibility would be to just return limited results, and let it break whenever a nan is involved in any results calculations. (my guess is that f_test and t_test still work if the NaNed parameter is not in the restriction matrix)


Using qr in GLM and RLM by default:
nan propagation sounds like it would require a lot of refactoring, with special cases that are irrelevant for the non-singular case (and possibly costly).
Just automatically reducing exog would be possible with the same effect as in linear model.
(edit: add WLS fit "method" as an option for GLM, RLM https://github.com/dieterv77/statsmodels/commit/4e025a174ce539e744a4a04036958d07a7e9e187 )

If nan propagation is too difficult to implement (which I guess it is), then I don't see any big advantage in doing the dropping of exog columns automatically, instead of letting the user decide what to do about the singular design matrix.
In this case raising an exception and telling the users which variables to drop, sounds more useful.


Separate question: what to use as default, pinv or a variation on qr?

I don't have a very strong opinion:
main disadvantage is that it's a big break in backwards compatibility and we and existing users would have to check where they rely on the pinv behavior (and add method='pinv' everywhere it matters).
It might create some inconsistencies with respect to sensitivity to statistical problems. I don't like costly checks by default, so currently we require users to check their results (use regression diagnostics). If the calculations produce the checks as a by-product, then we can increase the number of early warnings or exceptions.

My favorite solution would still me a global option, mode='paranoid' or mode='do the work and don't raise' (or error level).
plus context managers (?) for specific use cases.


aside: My recent case with singular design matrices and no solution yet:
for maximum trimmed likelihood (dropping outliers) I needed to run a Poisson regression on a small subsample of the data (a few hundred times). The exog was mostly categorical variables, so most subsamples were singular. The default fit() just raises a LinalgError somewhere in the code. What's the solution?
(my short term solution was to use another example and ignore the problem.)

Josef
SMEP: Singular Design Matrices


Damien Moore

unread,
Aug 9, 2012, 10:39:09 AM8/9/12
to pystat...@googlegroups.com

I get the same results with my original script. In my case the percentage errors in the params was worse, because I had some true coefficients equal to zero (after the singular column).
My impression was that the column for the singular column is just noise which affects the later columns.

My conclusion was also that if we don't have a qr algorithm that automatically drops a variable that is a linear combination of the earlier ones, then running a second qr on the reduced X matrix is the only way to go, conditional on the almost zero entries in the r diagonal.


Perhaps worth investigating an alternative QR. Was hoping there might be an easy way to create the reduced QR from the full one.
 
One question is how small should the threshold for the r diagonal be. In my example
>>> np.min(np.abs(np.diag(r)))
8.9373761050510275e-16
>>> np.min(np.abs(np.diag(r**2)))
7.987669164313708e-31

Should this be relative to the scale of X ?


Not sure. At a minimum I guess you want the threshold to be sufficiently high to prevent singular matrix issues.


The way you calculate the singular vectors (linear combinations) is nice, I never  thought about that.

It is useful, especially with large systems.
 

It pretty much agrees with my understanding, which is mostly trial and error and some rough intuition (and taking hints from Charles Harris.)
https://github.com/josef-pkt/statsmodels/commits/sequential_ols
is my main experimental branch to a more efficient linear algebra for some use cases.


I will take a look.

 
some thought were to go (Warning: I'm talking too much on mailing lists)

method='qr' needs fixing for singular exog

>>> OLS(y, x).fit(method='qr').params
array([  1.11209641e+15,  -1.11209641e+15,  -8.29023032e-02,
        -5.91013477e-02])
>>> OLS(y, x).fit(method='pinv').params
array([ 1.02385541,  1.02385541, -0.097463  , -0.0719577 ])

option: raise informative exception:
(without qr that can automatically drop variables)
checking the diagonal of r for almost zeros: If any, then raise exception with information about the singular linear combinations.

Would be nice to have an automated way to get the reduced model as well.

 

that sounds like an easy short term solution

option: drop variables automatically

the main implementation problem of OLS (linear model) is the nan propagation in the results.
If we replace an unidentified params with nan, then later calculations will break because they are not designed to handle nans.  f_test, t_test, ... ?

Yes, in retrospect filling the param vector with nan probably isn't helpful. I was mostly thinking in terms of summary output where it would be nice to show the full model and display n/a (or the equation) for removed parameters. That assumes you choose to go that way, of course.


Using qr in GLM and RLM by default:
nan propagation sounds like it would require a lot of refactoring, with special cases that are irrelevant for the non-singular case (and possibly costly).

Perhaps instead of nan, the internals need to pass around an internal exog and params that will contain the full model is non-singular or a reduced model.

Separate question: what to use as default, pinv or a variation on qr?

I don't have a very strong opinion:
main disadvantage is that it's a big break in backwards compatibility and we and existing users would have to check where they rely on the pinv behavior (and add method='pinv' everywhere it matters).

I don't have a strong opinion. Would make sense that the default is whatever works best for most people/problems.

My favorite solution would still me a global option, mode='paranoid' or mode='do the work and don't raise' (or error level).
plus context managers (?) for specific use cases.

I like that.
 

aside: My recent case with singular design matrices and no solution yet:
for maximum trimmed likelihood (dropping outliers) I needed to run a Poisson regression on a small subsample of the data (a few hundred times). The exog was mostly categorical variables, so most subsamples were singular. The default fit() just raises a LinalgError somewhere in the code. What's the solution?

Maybe something like this

After the first LinalgError, check QR of exog, if singular matrix found:
(1) paranoid: raise exception, which contains data about the needed restrictions/suggested columns to drop
(2) auto: drop columns, try again.
Otherwise, reraise error

Reply all
Reply to author
Forward
0 new messages