IIRC, R uses a rank-revealing QR as the core algorithm here, and this produces practical difficulties because LAPACK does not implement this - you have to ship your own. Theirs is a fortran implementation from LINPACK, again IIRC.
I also like the behavior you're talking about, and I think it's pretty straightforward to implement by using the SVD directly and checking for and handling null singular values explicitly. (So computationally this is similar to using pinv, but with tweaked internals.)
-n
Hi,
I'm new to statsmodels. Have been using scipy/numpy for quite a while and finally had the need to do some regression modeling. I haven't used statsmodels extensively, but so far my experience with it has been very nice. Nonetheless, I have a question about the behavior of statsmodels regression fit routines when the exogenous variable matrix is singular. This topic has come up in the past (specifically https://groups.google.com/d/topic/pystatsmodels/AZXNfPcq_HI/discussion).
In statsmodels, I think one of two things happens when a singular matrix is passed as the exogenous variables to an ols/glm/discretemodel regression fit:
a) if the parameters are estimated with 'pinv', we get a solution that is hard to interpret, and no warning about singular X
b) the solver fails spitting out a traceback related to the singular matrix
In contrast, R and many other packages, will effectively drop columns until the system is singular (and report n/a for the parameter on the column that was dropped). I recall that R uses a nice QR decomp trick to do this efficiently. See below for an example session with output. Another package I've used, but can't recall which actually spits out the restriction equation that relates the unidentified parameters to the identified parameters as part of the output -- I think this also drops out of the QR decomp (or perhaps some other decomp).
So my question is, wouldn't it be better, as the default behavior, to do something more like R does, by replacing redundant (unidentified) parameters with, say, "nan" plus issue a warning? (Even nicer if this was consistent across all of the fit routines.)
IIRC, R uses a rank-revealing QR as the core algorithm here, and this produces practical difficulties because LAPACK does not implement this - you have to ship your own. Theirs is a fortran implementation from LINPACK, again IIRC.
I also like the behavior you're talking about, and I think it's pretty straightforward to implement by using the SVD directly and checking for and handling null singular values explicitly. (So computationally this is similar to using pinv, but with tweaked internals.)
Josef
Just on the math I was under the impression it was something like this...
(pseudocode/handwaving)
Q,R = qr(X)
Z = rows of R that are indistinguishable from zero (based on some threshold)
Ri = R[~Z,~Z]
Qi = Q[:,~Z]
Ru = R[~Z,Z]
solve for the identified coefficients using Ri and Qi in place of R,Q
solve the relationships between unidentified and identified by solving for v in: Ri v = -Ru (v will be a matrix of coefficients on the identified parameters)
It didn't work for me in some brief experiments I did before with the singular case and QR. My impression is that the components after the redundant variable are not correct for this.
If anyone gets this to work, it would be a nice addition.
Josef
On Tuesday, August 7, 2012 2:56:54 AM UTC-4, josefpktd wrote:It didn't work for me in some brief experiments I did before with the singular case and QR. My impression is that the components after the redundant variable are not correct for this.
If anyone gets this to work, it would be a nice addition.
Josef
I put some code up on GitHub that makes a start. First time using GitHub so bear with me if something went wrong. Here's the git link: git://github.com/spillz/sci-comp.git
The script qr_fit.py solves a simple OLS regression with 3 of the 6 exog variables perfectly correlated. You will see that the approach of dropping rows/columns of the zero diagonal entries of R from the full QR results in a fit that doesn't quite match the parameters if you manually drop the columns before doing the QR. I guess this is because the zero diagonal entries are not exactly zero resulting in approximation errors, but I might just have the procedure completely screwed up :)
I think I have this roughly right, though I must confess that I haven't actually worked through the math, mostly used trial and error. The code can obviously be made a lot more efficient (and general purpose) but that can wait until it is actually correct.
I get the same results with my original script. In my case the percentage errors in the params was worse, because I had some true coefficients equal to zero (after the singular column).
My impression was that the column for the singular column is just noise which affects the later columns.
My conclusion was also that if we don't have a qr algorithm that automatically drops a variable that is a linear combination of the earlier ones, then running a second qr on the reduced X matrix is the only way to go, conditional on the almost zero entries in the r diagonal.
One question is how small should the threshold for the r diagonal be. In my example
>>> np.min(np.abs(np.diag(r)))
8.9373761050510275e-16
>>> np.min(np.abs(np.diag(r**2)))
7.987669164313708e-31
Should this be relative to the scale of X ?
The way you calculate the singular vectors (linear combinations) is nice, I never thought about that.
It pretty much agrees with my understanding, which is mostly trial and error and some rough intuition (and taking hints from Charles Harris.)
https://github.com/josef-pkt/statsmodels/commits/sequential_ols
is my main experimental branch to a more efficient linear algebra for some use cases.
some thought were to go (Warning: I'm talking too much on mailing lists)
method='qr' needs fixing for singular exog
>>> OLS(y, x).fit(method='qr').params
array([ 1.11209641e+15, -1.11209641e+15, -8.29023032e-02,
-5.91013477e-02])
>>> OLS(y, x).fit(method='pinv').params
array([ 1.02385541, 1.02385541, -0.097463 , -0.0719577 ])
option: raise informative exception:
(without qr that can automatically drop variables)
checking the diagonal of r for almost zeros: If any, then raise exception with information about the singular linear combinations.
that sounds like an easy short term solution
option: drop variables automatically
the main implementation problem of OLS (linear model) is the nan propagation in the results.
If we replace an unidentified params with nan, then later calculations will break because they are not designed to handle nans. f_test, t_test, ... ?
Using qr in GLM and RLM by default:
nan propagation sounds like it would require a lot of refactoring, with special cases that are irrelevant for the non-singular case (and possibly costly).
Separate question: what to use as default, pinv or a variation on qr?
I don't have a very strong opinion:
main disadvantage is that it's a big break in backwards compatibility and we and existing users would have to check where they rely on the pinv behavior (and add method='pinv' everywhere it matters).
My favorite solution would still me a global option, mode='paranoid' or mode='do the work and don't raise' (or error level).
plus context managers (?) for specific use cases.
aside: My recent case with singular design matrices and no solution yet:
for maximum trimmed likelihood (dropping outliers) I needed to run a Poisson regression on a small subsample of the data (a few hundred times). The exog was mostly categorical variables, so most subsamples were singular. The default fit() just raises a LinalgError somewhere in the code. What's the solution?