Memory Error for Robust Standard Errors Using GLM

Charles Martineau

unread,

Sep 8, 2016, 12:02:34 AM9/8/16

to pystatsmodels

I have been trying to implement some of Josef suggestion here on how to do robust standard errors for GLM (binomial logit to be exact). But I am running into memory error messages.

For example, I do the following:

mod1 = sm.GLM(df['ca'], df[int_col_s], family=sm.families.Binomial()) # int_col_s a list of the x variables
mod1.fit(disp=False, cov_type='cluster', cov_kwds=dict(groups=np.array(df.Ticker)))
# or
mod1.fit(disp=False, cov_type='HC1', cov_kwds={})

and I get this error:

MemoryError Traceback (most recent call last)
<ipython-input-207-98c3158770e6> in <module>()
1 ca_['CA_t'] = (ca_['CA']+1)/2
2 mod1 = sm.GLM(ca_['CA_t'], ca_[int_col_s], family=sm.families.Binomial())
----> 3 mod1.fit(disp=False, cov_type='cluster', cov_kwds=dict(groups=np.array(df.Ticker)))
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\genmod\generalized_linear_model.pyc in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, **kwargs)
727 self.scale,
728 cov_type=cov_type, cov_kwds=cov_kwds,
--> 729 use_t=use_t)
730
731 history['iteration'] = iteration + 1
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\genmod\generalized_linear_model.pyc in __init__(self, model, params, normalized_cov_params, scale, cov_type, cov_kwds, use_t)
920 cov_kwds = {}
921 get_robustcov_results(self, cov_type=cov_type, use_self=True,
--> 922 use_t=use_t, **cov_kwds)
923
924 @cache_readonly
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\base\covtype.py in get_robustcov_results(self, cov_type, use_t, **kwds)
193 self.n_groups = n_groups = len(np.unique(groups))
194 res.cov_params_default = sw.cov_cluster(self, groups,
--> 195 use_correction=use_correction)
196
197 elif groups.ndim == 2:
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\stats\sandwich_covariance.pyc in cov_cluster(results, group, use_correction)
528 '''
529 #TODO: currently used version of groupsums requires 2d resid
--> 530 xu, hessian_inv = _get_sandwich_arrays(results)
531
532 if not hasattr(group, 'dtype') or group.dtype != np.dtype('int'):
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\stats\sandwich_covariance.pyc in _get_sandwich_arrays(results)
239 elif hasattr(results.model, 'score_obs'):
240 xu = results.model.score_obs(results.params)
--> 241 hessian_inv = np.linalg.inv(results.model.hessian(results.params))
242 else:
243 xu = results.model.wexog * results.wresid[:, None]
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\genmod\generalized_linear_model.pyc in hessian(self, params, scale, observed)
418 """
419
--> 420 factor = self.hessian_factor(params, scale=scale, observed=observed)
421 hess = -np.dot(self.exog.T * factor, self.exog)
422 return hess
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\genmod\generalized_linear_model.pyc in hessian_factor(self, params, scale, observed)
378 raise RuntimeError('something wrong')
379
--> 380 tmp = self.family.variance(mu) * self.family.link.deriv2(mu)
381 tmp += self.family.variance.deriv(mu) * self.family.link.deriv(mu)
382
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\genmod\families\links.pyc in deriv2(self, p)
69 from statsmodels.tools.numdiff import approx_fprime_cs
70 # TODO: workaround proplem with numdiff for 1d
---> 71 return np.diag(approx_fprime_cs(p, self.deriv))
72
73 def inverse_deriv(self, z):
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\tools\numdiff.pyc in approx_fprime_cs(x, f, epsilon, args, kwargs)
182 n = len(x)
183 epsilon = _get_epsilon(x, 1, epsilon, n)
--> 184 increments = np.identity(n) * 1j * epsilon
185 #TODO: see if this can be vectorized, but usually dim is small
186 partials = [f(x+ih, *args, **kwargs).imag / epsilon[i]
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\numpy\core\numeric.pyc in identity(n, dtype)
2306 """
2307 from numpy import eye
-> 2308 return eye(n, dtype=dtype)
2309
2310 def allclose(a, b, rtol=1.e-5, atol=1.e-8, equal_nan=False):
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\numpy\lib\twodim_base.pyc in eye(N, M, k, dtype)
231 if M is None:
232 M = N
--> 233 m = zeros((N, M), dtype=dtype)
234 if k >= M:
235 return m
MemoryError:

What am I doing wrong?

Thanks!

josef...@gmail.com

unread,

Sep 8, 2016, 12:51:57 AM9/8/16

to pystatsmodels

What's your sample size nobs, number of exog and how many uniques are
in your groups?
and which version of statsmodels?

It breaks in the hessian_factor calculation which is slow, but I think
it shouldn't use a lot of memory.
I'm not sure why it is using the numerical derivative of the link at
the end, I thought this was changed to analytical derivatives.

I don't see anything wrong in your usage of the model, so it's the
size of your data and/or some inefficiencies in the implementation.

Also did you try discrete.Logit for this case which uses a more
specific hessian calculation? AFAIR, there are some redundancies in
GLM hessian with canonical link like binomial logit.
Using Logit would be just an alternative, if there are problems in
GLM, then they still need to be fixed.

Josef

Charles Martineau

unread,

Sep 8, 2016, 1:57:05 AM9/8/16

to pystatsmodels

Hi Josef,

thanks for your response. I have 84,000 observations in my regression. I have 10 indep variables. My dep variables is between 0 and 1. I have 20% of my data that equals 0 and 20% that equal 1. Maybe it is because of the distribution of the data? The histogram of the data looks like a u-shape. I have no problem running the GLM without the robust standard errors.

My version of statsmodels is 0.6.1

Can you expand on discrete.logit ... ? How can I call this?

Thanks Josef,

josef...@gmail.com

unread,

Sep 8, 2016, 8:35:48 AM9/8/16

to pystatsmodels

On Thu, Sep 8, 2016 at 1:57 AM, Charles Martineau
<martinea...@gmail.com> wrote:
> Hi Josef,
>
> thanks for your response. I have 84,000 observations in my regression. I
> have 10 indep variables. My dep variables is between 0 and 1. I have 20% of
> my data that equals 0 and 20% that equal 1. Maybe it is because of the
> distribution of the data? The histogram of the data looks like a u-shape. I
> have no problem running the GLM without the robust standard errors.

nobs=84000 is much larger than our test cases, but still small.
There should not be problems at least up to a few million observations.

However, the numerical Hessian calculation for the Link requires the
wrong vectorization for our numdiff functions and builds up (nobs,
nobs) arrays, AFAICS and AFAIR.

Logit link has been fixed, but CDFLink including Probit link still has
that problem. (I will open an issue for CDFLink, but it will not be
fixed soon.)

>
> My version of statsmodels is 0.6.1

if you don't want to work with Logit, then you need to update statsmodels.

>
> Can you expand on discrete.logit ... ? How can I call this?

`sm.Logit` is defined in discrete/discrete_model
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html
there are also examples/notebooks in the documentation.

The results should be identical up to convergence precision, but GLM
and the `discrete` models have some non-overlapping post-estimation
features.

Josef

Reply all

Reply to author

Forward