OLS failed with the error ‘SVD does not converge’. Cannot understand why.

4,596 views
Skip to first unread message

Greg Dybalski

unread,
Oct 7, 2015, 4:44:52 PM10/7/15
to pystatsmodels

The following statsmodel command


Model2 = smf.ols( formula='R_G ~  bat_9innings + C(team, Treatment(reference=-1)) * C(Year, Treatment(reference=-1))',  data=P).fit()


results in the error, SVD did not converge.  As I understand, this error occurs because the matrix of independent variables, X, is not full rank.  But, the X matrix does have full rank.  The independent variable bat_9innings is a continuous variable (no missing values), and the team and year variables are categorical variables.   Team has 30 unique values, and year has 27 unique vales.  Because the two categorical variables are crossed, both a direct and crossed product terms are entered into the regression.  Consequently, the number of regression coefficients is about 780.  A constant term is implicit in the model. 


What makes this confusing is that a second model command performs essentially the same regression without error. 


Model2 = smf.ols( formula='R_G ~ +0+ bat_9innings + C(team, Treatment(reference=-1)) * C(Year, Treatment(reference=-1))', data=P).fit()


The second model command drops the intercept (0 is included in the formula), and the categorical variables have an additional dummy variable being entered into the regression.


Now to add to the confusion, these commands were executed on a second computer and both commands are executed without an error.  The estimated models are essentially the same.


My dataframe P has the following contents.


MultiIndex: 3702 entries, (1988, ATL, alvaj001) to (2014, WSN, zimmj003)

Data columns (total 20 columns):

Batters         3702 non-null float64

Others          3702 non-null float64

Pitchers        3702 non-null float64

playerID        3702 non-null object

IPOuts          3702 non-null float64

BFP             3702 non-null int64

BB              3702 non-null int64

HR              3702 non-null int64

SO              3702 non-null int64

HBP             3702 non-null float64

R               3702 non-null int64

IBB             3702 non-null int64

Innings         3702 non-null float64

G_9Innings      3702 non-null float64

FIP             3702 non-null float64

R_G             3702 non-null float64

BallsInPlay     3702 non-null int64

pit_9innings    3702 non-null float64

bat_9innings    3702 non-null float64

oth_9innings    3702 non-null float64

 

josef...@gmail.com

unread,
Oct 7, 2015, 5:10:31 PM10/7/15
to pystatsmodels
Most of the time full rank is not a requirement for SVD. SVD is designed for handling singular matrices.
We get the SVD not converged message if there are nans in the array, but when you use the formula the default is to remove rows with nans. So, that shouldn't be a problem.

Also, linear algebra problems in corner cases depend on the underlying linear algebra libraries, and the behavior can or will differ across library versions and computers.

So, right now I don't have a guess what could be wrong.

You can check the design matrix that was used by the model. You are using a large number of dummy variables, With a large number of crossed effects it becomes possible that a combination does not have any observations for it. In that case, a column would be all zeros.

try
np.ptp(Model2.model.exog)
np.mean(Model2.model.exog)

and check how many zeros you have.


I have written some functions and methods to check and remove collinear variables and columns, but those are not merged yet.
But, as I said I would be surprised if it's only collinearity that causes SVD not converged.

Josef

Greg Dybalski

unread,
Oct 8, 2015, 10:18:56 AM10/8/15
to pystatsmodels
Thank you for the assistance.  I ran the two suggested commands.

np.ptp(Model2.model.exog)
15.538461538461538

np.mean(Model2.model.exog)
0.015767416513506379

You mentioned that I should count the number of zeros, the two commands do not show any.

Does this help?

Greg

josef...@gmail.com

unread,
Oct 8, 2015, 10:31:17 AM10/8/15
to pystatsmodels
Sorry my mistake, I meant adding axis=0 to both

josef...@gmail.com

unread,
Oct 8, 2015, 10:32:41 AM10/8/15
to pystatsmodels
On Thu, Oct 8, 2015 at 10:31 AM, <josef...@gmail.com> wrote:
Sorry my mistake, I meant adding axis=0 to both

and now I hit the wrong shortcut key
np.ptp(Model2.model.exog, axis=0)

There should be only one zero if the constant is included explicitly

Josef

Greg Dybalski

unread,
Oct 9, 2015, 9:00:19 AM10/9/15
to pystatsmodels
After examining the data I discovered that the crossing of the team and Year variables, there were columns having all zeros.   I assume that this causes the SVD failure.

The team variable contains the names of baseball team, and some teams did not come into existence until 1998 or later.  The Year variable is from 1988 to 2014.  Consequently, the crossing of team and Year in 1988 results in a column contains all zeros.

But, I'm still a little confused.  In the two models, one including an intercept and one excluding, why would the model excluding the intercept execute?  The two regressions estimate the same model but differ slightly in how they are set up.  Both would include the columns containing all zeros.  So, why does one fail and the other succeed.

I appreciate you taking your time to offer assistance.

Greg



On Wednesday, October 7, 2015 at 4:44:52 PM UTC-4, Greg Dybalski wrote:

josef...@gmail.com

unread,
Oct 9, 2015, 9:37:16 AM10/9/15
to pystatsmodels
On Fri, Oct 9, 2015 at 9:00 AM, Greg Dybalski <dyba...@gmail.com> wrote:
After examining the data I discovered that the crossing of the team and Year variables, there were columns having all zeros.   I assume that this causes the SVD failure.

The team variable contains the names of baseball team, and some teams did not come into existence until 1998 or later.  The Year variable is from 1988 to 2014.  Consequently, the crossing of team and Year in 1988 results in a column contains all zeros.

But, I'm still a little confused.  In the two models, one including an intercept and one excluding, why would the model excluding the intercept execute?  The two regressions estimate the same model but differ slightly in how they are set up.  Both would include the columns containing all zeros.  So, why does one fail and the other succeed.

Being theoretically the same doesn't imply that it's the same at floating point precision.

My best guess is that it's a numerical idiosyncrasy of the linear algebra libraries BLAS/LAPACK that are used.
I guess that numerical noise accumulates in different ways which allows it to succeed in one but not the other case.

A similar numerical problem was recently raised on stackoverflow https://github.com/statsmodels/statsmodels/issues/2628
There, a factor is also encoded so the exog is singular, and when I run the example then I get about a third of the time a very inaccurate full rank solution. It's different from the current case because it always succeeds, but doesn't always produce "good numbers".


I tried out some examples with zero columns because of missing cells in cross effects a while ago. I always got good results (coefficient=0, IIRC, or equalized across related coefficients), but I only used small size made up data, which, I think now, doesn't run into the problem of noise accumulation.

I haven't finished my code to add options for the handling of singular and near singular design matrices and automatic removal or penalization for it.

I think the best way at the moment to get a numerically stable solution is to remove the zero columns, for example
use patsy.dmatrices to create pandas Series/DataFrame and then use the non formula interface.


Josef

Greg Dybalski

unread,
Oct 9, 2015, 10:24:01 AM10/9/15
to pystatsmodels
I thank you again for the assistance.  I plan to examine patsy's dmatrices.

Because I am so new to pandas and statsmodels, I had also estimated the model in R.  The results in R and SAS matched the results from statsmodels.  So, I was confident about the results from statsmodels.

Greg

On Wednesday, October 7, 2015 at 4:44:52 PM UTC-4, Greg Dybalski wrote:

josef...@gmail.com

unread,
Oct 9, 2015, 10:47:38 AM10/9/15
to pystatsmodels
On Fri, Oct 9, 2015 at 10:24 AM, Greg Dybalski <dyba...@gmail.com> wrote:
I thank you again for the assistance.  I plan to examine patsy's dmatrices.

Because I am so new to pandas and statsmodels, I had also estimated the model in R.  The results in R and SAS matched the results from statsmodels.  So, I was confident about the results from statsmodels.

We have good test coverage that verifies the results for reasonably nice cases. 

However, some edge cases can produce "weird" results. The recent example shows that we cannot fully trust a singular case even with OLS and generalized inverse. The warning that summary() prints is appropriate in this cases. We won't be able to anything about the statistical problem (non-unique solution), but until recently I thought the singular case works fine numerically.


Improving the behavior in edge cases requires that we learn about them, and then figure out a solution and implement them, or add appropriate warnings.

Corner cases with unexpected behavior can be reported on the issue tracker

Josef

Greg Dybalski

unread,
Oct 14, 2015, 12:46:05 PM10/14/15
to pystatsmodels
Hi,

The SVD error still occurs, but under a particular arrangement of the columns of the design matrix.  Previously, I did have zero columns.  For my regression I identified columns having all zeros and removed these meaningless columns. But, I continue to receive a 'LinAlgError: SVD did not converge' on how the design matrix is arranged. This error should not occur based on the order of the columns.


The design matrix is set up and modified using these commands:

  1. Design_mat = dmatrix( 'C(team, Treatment(reference=-1)) * C(Year, Treatment(reference=-1))',data=P, return_type="dataframe")

  2. zeroCols = Design_mat.columns[(Design_mat == 0).all()]

  3. DMatrix = Design_mat.drop(zeroCols, axis=1)

  4. Three additional columns of continuous variables are inserted into Dmatrix. Only the following order causes the SVD error. Adding these columns in any order does not cause an error. The regression is run using DMatrix.

    1. DMatrix['oth_9innings'] = P['oth_9innings']

    2. DMatrix['bat_9innings'] = P['bat_9innings']

    3. DMatrix['pit_9innings'] = P['pit_9innings']


In step1 the design matrix, Design_mat , was created based on the crossing of the two categorical variables. In step 2 the columns of zeros are identified and then removed in step 3. In step 4 the 3 continuous variables are inserted into the data frame. The three variables are 'oth_9innings', 'bat_9innings', and 'pit_9innings'.



On Wednesday, October 7, 2015 at 4:44:52 PM UTC-4, Greg Dybalski wrote:

josef...@gmail.com

unread,
Oct 14, 2015, 1:22:21 PM10/14/15
to pystatsmodels
On Wed, Oct 14, 2015 at 12:46 PM, Greg Dybalski <dyba...@gmail.com> wrote:
Hi,

The SVD error still occurs, but under a particular arrangement of the columns of the design matrix.  Previously, I did have zero columns.  For my regression I identified columns having all zeros and removed these meaningless columns. But, I continue to receive a 'LinAlgError: SVD did not converge' on how the design matrix is arranged. This error should not occur based on the order of the columns.


The design matrix is set up and modified using these commands:

  1. Design_mat = dmatrix( 'C(team, Treatment(reference=-1)) * C(Year, Treatment(reference=-1))',data=P, return_type="dataframe")

  2. zeroCols = Design_mat.columns[(Design_mat == 0).all()]

equality comparison of float and 0 might not be enough, because there will be numerical noise.

Assuming you don't have columns with tiny values, you could try

zeroCols = Design_mat.columns[(abs(Design_mat) < 1e-12).all()]

I don't remember if pandas has an `abs` method or use np.abs.

If that doesn't help, then I might have to look at the data or similar data that replicates the problem.
(or I need to speed up finishing and merging Ridge Regression.)


two more checks for multicollinearity or reduced rank when you have the design matrix for the regression DMatrix available.

np.linalg.matrix_rank(DMatrix)    # uses some threshold to indentify rank, compare with number of columns

from scipy import linalg
linalg.svdvals(DMatrix) 

The first might fail because it also uses SVD, IIRC.
using scipy.linalg might use a different linear algebra function or use different defaults.

maybe also something like
np.diag(np.linalg.qr(DMatrix)[1])

if svdvals or the last expression has any values smaller in absolute value than 1e-12 or so, then there is still a serious problem with multicollinearity.

Josef

Josef

Greg Dybalski

unread,
Oct 14, 2015, 5:04:40 PM10/14/15
to pystatsmodels
Josef,

I've been thinking about this problem. The problem occurs when the matrix of exogenous variables is arranged in a certain order.  Would the SVD routines be sensitive to the order of the columns?  The last three columns can be arranged 6 different ways, but only 1 arrangement causes the SVD failure.  The other 5 ways to arrange the columns do not result in a SVD failure.  Are we looking in the wrong place?  

We have been looking for a problem in the data, could there be a problem with the installed programs?  Can we test the installation?

To answer your questions I used the Dmatrix. This design matrix had the columns containing all zeros dropped leaving 780 dummy variable columns. 3 additional columns of continuous variables inserted in the order shown in the previous posting. Dmatrix has 783 columns and failed.  I did examine the removal of columns having all values near zero, and the result was identical to removing all columns having values equal to zero.


I ran the suggested commands. Here are the results.


# Test for columns consisting entierly of values near zero.

zeroColsf = DMatrix.columns[(abs(DMatrix) < 1e-12).all()]

len(zeroColsf)

0


# The rank of the matrix

np.linalg.matrix_rank(DMatrix)

783


# Alternative SVD methods

from scipy import linalg

testSVD = pd.DataFrame( linalg.svdvals(DMatrix) )

testSVD.describe()


count   783.000000
mean      4.680002
std      49.173778
min       0.077035
25%       2.000000
50%       2.236068
75%       2.236068
max    1368.883552



factor = pd.DataFrame( np.diag(np.linalg.qr(DMatrix)[1]) )

factor.describe()

count  780.000000
mean     0.473117
std      4.020911
min    -60.844063
25%     -1.847557
50%      1.718187
75%      2.047259
max     11.668308


Greg


On Wednesday, October 7, 2015 at 4:44:52 PM UTC-4, Greg Dybalski wrote:

josef...@gmail.com

unread,
Oct 15, 2015, 3:15:00 PM10/15/15
to pystatsmodels
On Wed, Oct 14, 2015 at 5:04 PM, Greg Dybalski <dyba...@gmail.com> wrote:
Josef,

I've been thinking about this problem. The problem occurs when the matrix of exogenous variables is arranged in a certain order.  Would the SVD routines be sensitive to the order of the columns?  The last three columns can be arranged 6 different ways, but only 1 arrangement causes the SVD failure.  The other 5 ways to arrange the columns do not result in a SVD failure.  Are we looking in the wrong place?  

We have been looking for a problem in the data, could there be a problem with the installed programs?  Can we test the installation?

One last check on the data. Is `DMatrix.dtype` float64?   or np.asarray(DMatrix).dtype if necessary

Given your svdvals below, there is nothing wrong with the data. smallest singular value is pretty large (not close to zero) and condition number is elevated, but nothing that float64 shouldn't be able to handle without problems.

You could see what running the numpy test suite reports.   it's just np.test()

The data was my prime suspect, because I haven't heard of (m)any serious problems with the linalg libraries, the usual cases are if the problems are at the edge of numerical precision.

np.test() might not show anything if it's subtle and only shows up in very specific cases. The questions then are: which operating system, numpy and python version, which BLAS/LAPACK libraries. 
And most of that is out of my area, where I don't know any details.


As a workaround that should work as long as you don't have singular design matrices is to use `method='qr'` in the call to OLS.fit. 

There could still be a problem because we check the svdvals to determine the degrees of freedom, matrix rank of exog.  AFAICS, you never copied the full traceback, so I don't know if you fail in svdvals or in pinv. 

If using qr still fails in SVD, then we need to cheat for now and maybe add a try except protection in the code.
This doesn't show how close to zero those factors are. I had forgotten that QR doesn't fix the sign on the diagonal.
But svdvals above already shows that there is not near perfect collinearity.

Josef

Greg Dybalski

unread,
Oct 16, 2015, 1:44:33 PM10/16/15
to pystatsmodels

Josef,


Before performing any of your suggestions I re-installed Anaconda 64-bit. In addition, I made sure that the latest version of numpy was installed. The regression continues to fail when the DMatrix has a specific column ordering.


Here is the requested information.


 Is `DMatrix.dtype` float64? 

Dmatrix.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3702 entries, 0 to 3701
Columns: 783 entries, Intercept to pit_9innings
dtypes: float64(783)
memory usage: 22.1 MB


=========================================================================

np.test()

The test did run but ended with some errors. Some of the output from the test is shown below.


Running unit tests for numpy
NumPy version 1.10.1
NumPy relaxed strides checking option: True
NumPy is installed in C:\Users\Daddio1949\Anaconda3\lib\site-packages\numpy
Python version 3.4.3 |Anaconda 2.3.0 (64-bit)| (default, Mar  6 2015, 12:06:10) [MSC v.1600 64 bit (AMD64)]
nose version 1.3.7


ERROR: test_log2_special (test_umath.TestLog2)
ERROR: test_compile1 (test_system_info.TestSystemInfoReading)
ERROR: test_rfft (test_fft.TestFFT1D)
ERROR: test_scripts.test_f2py
FAIL: test_default (test_numeric.TestSeterr)
Ran 5569 tests in 81.127s

FAILED (KNOWNFAIL=8, SKIP=11, errors=4, failures=1)
Out[15]:
<nose.result.TextTestResult run=5569 errors=4 failures=1>


=========================================================================

Adding  `method='qr' as a work around did succeed. The regression worked.


=========================================================================

The full traceback from the failed regression.

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-109-20fe8eea5c89> in <module>()
----> 1 Case5 = sm.OLS( P.R_G, DMatrix).fit()
      2 Case5.summary()

C:\Users\Daddio1949\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py in fit(self, method, cov_type, cov_kwds, use_t, **kwargs)
    172                 (not hasattr(self, 'rank'))):
    173 
--> 174                 self.pinv_wexog, singular_values = pinv_extended(self.wexog)
    175                 self.normalized_cov_params = np.dot(self.pinv_wexog,
    176                                         np.transpose(self.pinv_wexog))

C:\Users\Daddio1949\Anaconda3\lib\site-packages\statsmodels\tools\tools.py in pinv_extended(X, rcond)
    390     X = np.asarray(X)
    391     X = X.conjugate()
--> 392     u, s, vt = np.linalg.svd(X, 0)
    393     s_orig = np.copy(s)
    394     m = u.shape[0]

C:\Users\Daddio1949\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in svd(a, full_matrices, compute_uv)
   1357 
   1358         signature = 'D->DdD' if isComplexType(t) else 'd->ddd'
-> 1359         u, s, vt = gufunc(a, signature=signature, extobj=extobj)
   1360         u = u.astype(result_t, copy=False)
   1361         s = s.astype(_realType(result_t), copy=False)

C:\Users\Daddio1949\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_svd_nonconvergence(err, flag)
     97 
     98 def _raise_linalgerror_svd_nonconvergence(err, flag):
---> 99     raise LinAlgError("SVD did not converge")
    100 
    101 def get_linalg_error_extobj(callback):

LinAlgError: SVD did not converge



Greg



On Wednesday, October 7, 2015 at 4:44:52 PM UTC-4, Greg Dybalski wrote:

josef...@gmail.com

unread,
Oct 16, 2015, 2:29:41 PM10/16/15
to pystatsmodels
On Fri, Oct 16, 2015 at 1:44 PM, Greg Dybalski <dyba...@gmail.com> wrote:

Josef,


Before performing any of your suggestions I re-installed Anaconda 64-bit. In addition, I made sure that the latest version of numpy was installed. The regression continues to fail when the DMatrix has a specific column ordering.


Here is the requested information.


 Is `DMatrix.dtype` float64? 

Dmatrix.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3702 entries, 0 to 3701
Columns: 783 entries, Intercept to pit_9innings
dtypes: float64(783)
memory usage: 22.1 MB



good, as expected
 

=========================================================================

np.test()

The test did run but ended with some errors. Some of the output from the test is shown below.


Running unit tests for numpy
NumPy version 1.10.1
NumPy relaxed strides checking option: True
NumPy is installed in C:\Users\Daddio1949\Anaconda3\lib\site-packages\numpy
Python version 3.4.3 |Anaconda 2.3.0 (64-bit)| (default, Mar  6 2015, 12:06:10) [MSC v.1600 64 bit (AMD64)]
nose version 1.3.7


ERROR: test_log2_special (test_umath.TestLog2)
ERROR: test_compile1 (test_system_info.TestSystemInfoReading)
ERROR: test_rfft (test_fft.TestFFT1D)
ERROR: test_scripts.test_f2py
FAIL: test_default (test_numeric.TestSeterr)
Ran 5569 tests in 81.127s

FAILED (KNOWNFAIL=8, SKIP=11, errors=4, failures=1)
Out[15]:
<nose.result.TextTestResult run=5569 errors=4 failures=1>



those are not directly linalg related, but AFAIK they shouldn't fail.
My guess is that there are some problems either how numpy is compiled or with the underlying linalg libraries.
 

=========================================================================

Adding  `method='qr' as a work around did succeed. The regression worked.



That should work then as a solution on your system. 
The main difference between QR and the default pinv version is in how they handle singular design matrices. 
R uses QR by default, but they have a version that moves singular variables to the end (pivoting), which wasn't available in scipy until a recent version.
Ok, that's in the calculation for the parameters, so it looks like you don't have problems in the rank determination. This part is not used when using method='qr'.

At this stage, I think we cannot do much from the statsmodels side. Over time we should add additional methods for fit, so more options are available if one or several won't work.

(I was thinking about using Ridge regression to numerically stabilize, but a numerically efficient implementation would use SVD which wouldn't help in this case.)

Josef

josef...@gmail.com

unread,
Oct 24, 2015, 11:29:13 AM10/24/15
to pystatsmodels
Related: Maybe BLAS/LAPACK problems should be higher again on the list of "suspects"

two recent scipy issues, with broken installation or LAPACK bug in a corner case


Josef
Reply all
Reply to author
Forward
0 new messages