The following statsmodel command
Model2 = smf.ols( formula='R_G ~ bat_9innings + C(team, Treatment(reference=-1)) * C(Year, Treatment(reference=-1))', data=P).fit()
results in the error, SVD did not converge. As I understand, this error occurs because the matrix of independent variables, X, is not full rank. But, the X matrix does have full rank. The independent variable bat_9innings is a continuous variable (no missing values), and the team and year variables are categorical variables. Team has 30 unique values, and year has 27 unique vales. Because the two categorical variables are crossed, both a direct and crossed product terms are entered into the regression. Consequently, the number of regression coefficients is about 780. A constant term is implicit in the model.
What makes this confusing is that a second model command performs essentially the same regression without error.
Model2 = smf.ols( formula='R_G ~ +0+ bat_9innings + C(team, Treatment(reference=-1)) * C(Year, Treatment(reference=-1))', data=P).fit()
The second model command drops the intercept (0 is included in the formula), and the categorical variables have an additional dummy variable being entered into the regression.
Now to add to the confusion, these commands were executed on a second computer and both commands are executed without an error. The estimated models are essentially the same.
My dataframe P has the following contents.
MultiIndex: 3702 entries, (1988, ATL, alvaj001) to (2014, WSN, zimmj003)
Data columns (total 20 columns):
Batters 3702 non-null float64
Others 3702 non-null float64
Pitchers 3702 non-null float64
playerID 3702 non-null object
IPOuts 3702 non-null float64
BFP 3702 non-null int64
BB 3702 non-null int64
HR 3702 non-null int64
SO 3702 non-null int64
HBP 3702 non-null float64
R 3702 non-null int64
IBB 3702 non-null int64
Innings 3702 non-null float64
G_9Innings 3702 non-null float64
FIP 3702 non-null float64
R_G 3702 non-null float64
BallsInPlay 3702 non-null int64
pit_9innings 3702 non-null float64
bat_9innings 3702 non-null float64
oth_9innings 3702 non-null float64
15.538461538461538
np.mean(Model2.model.exog)0.015767416513506379You mentioned that I should count the number of zeros, the two commands do not show any.Does this help?Greg
Sorry my mistake, I meant adding axis=0 to both
After examining the data I discovered that the crossing of the team and Year variables, there were columns having all zeros. I assume that this causes the SVD failure.The team variable contains the names of baseball team, and some teams did not come into existence until 1998 or later. The Year variable is from 1988 to 2014. Consequently, the crossing of team and Year in 1988 results in a column contains all zeros.But, I'm still a little confused. In the two models, one including an intercept and one excluding, why would the model excluding the intercept execute? The two regressions estimate the same model but differ slightly in how they are set up. Both would include the columns containing all zeros. So, why does one fail and the other succeed.
I thank you again for the assistance. I plan to examine patsy's dmatrices.Because I am so new to pandas and statsmodels, I had also estimated the model in R. The results in R and SAS matched the results from statsmodels. So, I was confident about the results from statsmodels.
The design matrix is set up and modified using these commands:
Design_mat = dmatrix( 'C(team, Treatment(reference=-1)) * C(Year, Treatment(reference=-1))',data=P, return_type="dataframe")
zeroCols = Design_mat.columns[(Design_mat == 0).all()]
DMatrix = Design_mat.drop(zeroCols, axis=1)
Three additional columns of continuous variables are inserted into Dmatrix. Only the following order causes the SVD error. Adding these columns in any order does not cause an error. The regression is run using DMatrix.
DMatrix['oth_9innings'] = P['oth_9innings']
DMatrix['bat_9innings'] = P['bat_9innings']
DMatrix['pit_9innings'] = P['pit_9innings']
In step1 the design matrix, Design_mat , was created based on the crossing of the two categorical variables. In step 2 the columns of zeros are identified and then removed in step 3. In step 4 the 3 continuous variables are inserted into the data frame. The three variables are 'oth_9innings', 'bat_9innings', and 'pit_9innings'.
Hi,The SVD error still occurs, but under a particular arrangement of the columns of the design matrix. Previously, I did have zero columns. For my regression I identified columns having all zeros and removed these meaningless columns. But, I continue to receive a 'LinAlgError: SVD did not converge' on how the design matrix is arranged. This error should not occur based on the order of the columns.
The design matrix is set up and modified using these commands:
Design_mat = dmatrix( 'C(team, Treatment(reference=-1)) * C(Year, Treatment(reference=-1))',data=P, return_type="dataframe")
zeroCols = Design_mat.columns[(Design_mat == 0).all()]
To answer your questions I used the Dmatrix. This design matrix had the columns containing all zeros dropped leaving 780 dummy variable columns. 3 additional columns of continuous variables inserted in the order shown in the previous posting. Dmatrix has 783 columns and failed. I did examine the removal of columns having all values near zero, and the result was identical to removing all columns having values equal to zero.
I ran the suggested commands. Here are the results.
# Test for columns consisting entierly of values near zero.
zeroColsf = DMatrix.columns[(abs(DMatrix) < 1e-12).all()]
len(zeroColsf)
0
# The rank of the matrix
np.linalg.matrix_rank(DMatrix)
783
# Alternative SVD methods
from scipy import linalg
testSVD = pd.DataFrame( linalg.svdvals(DMatrix) )
testSVD.describe()
count 783.000000 mean 4.680002 std 49.173778 min 0.077035 25% 2.000000 50% 2.236068 75% 2.236068 max 1368.883552
factor = pd.DataFrame( np.diag(np.linalg.qr(DMatrix)[1]) )
factor.describe()
count 780.000000 mean 0.473117 std 4.020911 min -60.844063 25% -1.847557 50% 1.718187 75% 2.047259 max 11.668308
Josef,I've been thinking about this problem. The problem occurs when the matrix of exogenous variables is arranged in a certain order. Would the SVD routines be sensitive to the order of the columns? The last three columns can be arranged 6 different ways, but only 1 arrangement causes the SVD failure. The other 5 ways to arrange the columns do not result in a SVD failure. Are we looking in the wrong place?We have been looking for a problem in the data, could there be a problem with the installed programs? Can we test the installation?
Josef,
Before performing any of your suggestions I re-installed Anaconda 64-bit. In addition, I made sure that the latest version of numpy was installed. The regression continues to fail when the DMatrix has a specific column ordering.
Here is the requested information.
Is `DMatrix.dtype` float64?
Dmatrix.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3702 entries, 0 to 3701 Columns: 783 entries, Intercept to pit_9innings dtypes: float64(783) memory usage: 22.1 MB
=========================================================================
np.test()
The test did run but ended with some errors. Some of the output from the test is shown below.
Running unit tests for numpy NumPy version 1.10.1 NumPy relaxed strides checking option: True NumPy is installed in C:\Users\Daddio1949\Anaconda3\lib\site-packages\numpy Python version 3.4.3 |Anaconda 2.3.0 (64-bit)| (default, Mar 6 2015, 12:06:10) [MSC v.1600 64 bit (AMD64)] nose version 1.3.7
ERROR: test_log2_special (test_umath.TestLog2) ERROR: test_compile1 (test_system_info.TestSystemInfoReading) ERROR: test_rfft (test_fft.TestFFT1D) ERROR: test_scripts.test_f2py FAIL: test_default (test_numeric.TestSeterr) Ran 5569 tests in 81.127s FAILED (KNOWNFAIL=8, SKIP=11, errors=4, failures=1) Out[15]: <nose.result.TextTestResult run=5569 errors=4 failures=1>
=========================================================================
Adding `method='qr' as a work around did succeed. The regression worked.
=========================================================================
The full traceback from the failed regression.
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-109-20fe8eea5c89> in <module>()
----> 1 Case5 = sm.OLS( P.R_G, DMatrix).fit()
2 Case5.summary()
C:\Users\Daddio1949\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py in fit(self, method, cov_type, cov_kwds, use_t, **kwargs)
172 (not hasattr(self, 'rank'))):
173
--> 174 self.pinv_wexog, singular_values = pinv_extended(self.wexog)
175 self.normalized_cov_params = np.dot(self.pinv_wexog,
176 np.transpose(self.pinv_wexog))
C:\Users\Daddio1949\Anaconda3\lib\site-packages\statsmodels\tools\tools.py in pinv_extended(X, rcond)
390 X = np.asarray(X)
391 X = X.conjugate()
--> 392 u, s, vt = np.linalg.svd(X, 0)
393 s_orig = np.copy(s)
394 m = u.shape[0]
C:\Users\Daddio1949\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in svd(a, full_matrices, compute_uv)
1357
1358 signature = 'D->DdD' if isComplexType(t) else 'd->ddd'
-> 1359 u, s, vt = gufunc(a, signature=signature, extobj=extobj)
1360 u = u.astype(result_t, copy=False)
1361 s = s.astype(_realType(result_t), copy=False)
C:\Users\Daddio1949\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_svd_nonconvergence(err, flag)
97
98 def _raise_linalgerror_svd_nonconvergence(err, flag):
---> 99 raise LinAlgError("SVD did not converge")
100
101 def get_linalg_error_extobj(callback):LinAlgError: SVD did not converge
Greg
Josef,
Before performing any of your suggestions I re-installed Anaconda 64-bit. In addition, I made sure that the latest version of numpy was installed. The regression continues to fail when the DMatrix has a specific column ordering.
Here is the requested information.
Is `DMatrix.dtype` float64?
Dmatrix.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3702 entries, 0 to 3701 Columns: 783 entries, Intercept to pit_9innings dtypes: float64(783) memory usage: 22.1 MB
=========================================================================
np.test()
The test did run but ended with some errors. Some of the output from the test is shown below.
Running unit tests for numpy NumPy version 1.10.1 NumPy relaxed strides checking option: True NumPy is installed in C:\Users\Daddio1949\Anaconda3\lib\site-packages\numpy Python version 3.4.3 |Anaconda 2.3.0 (64-bit)| (default, Mar 6 2015, 12:06:10) [MSC v.1600 64 bit (AMD64)] nose version 1.3.7
ERROR: test_log2_special (test_umath.TestLog2) ERROR: test_compile1 (test_system_info.TestSystemInfoReading) ERROR: test_rfft (test_fft.TestFFT1D) ERROR: test_scripts.test_f2py FAIL: test_default (test_numeric.TestSeterr) Ran 5569 tests in 81.127s FAILED (KNOWNFAIL=8, SKIP=11, errors=4, failures=1) Out[15]: <nose.result.TextTestResult run=5569 errors=4 failures=1>
=========================================================================
Adding `method='qr' as a work around did succeed. The regression worked.