Memory Error for Robust Standard Errors Using GLM

55 views
Skip to first unread message

Charles Martineau

unread,
Sep 8, 2016, 12:02:34 AM9/8/16
to pystatsmodels
I have been trying to implement some of Josef suggestion here on how to do robust standard errors for GLM (binomial logit to be exact). But I am running into memory error messages. 

For example, I do the following:

mod1 = sm.GLM(df['ca'], df[int_col_s], family=sm.families.Binomial())  # int_col_s a list of the x variables
mod1.fit(disp=False, cov_type='cluster', cov_kwds=dict(groups=np.array(df.Ticker))) 
# or
mod1.fit(disp=False, cov_type='HC1', cov_kwds={})

and I get this error:

MemoryError                               Traceback (most recent call last)
<ipython-input-207-98c3158770e6> in <module>()
      1 ca_['CA_t'] = (ca_['CA']+1)/2
      2 mod1 = sm.GLM(ca_['CA_t'], ca_[int_col_s], family=sm.families.Binomial())
----> 3 mod1.fit(disp=False, cov_type='cluster', cov_kwds=dict(groups=np.array(df.Ticker))) 
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\genmod\generalized_linear_model.pyc in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, **kwargs)
    727                                  self.scale,
    728                                  cov_type=cov_type, cov_kwds=cov_kwds,
--> 729                                  use_t=use_t)
    730 
    731         history['iteration'] = iteration + 1
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\genmod\generalized_linear_model.pyc in __init__(self, model, params, normalized_cov_params, scale, cov_type, cov_kwds, use_t)
    920                 cov_kwds = {}
    921             get_robustcov_results(self, cov_type=cov_type, use_self=True,
--> 922                                        use_t=use_t, **cov_kwds)
    923 
    924     @cache_readonly
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\base\covtype.py in get_robustcov_results(self, cov_type, use_t, **kwds)
    193                 self.n_groups = n_groups = len(np.unique(groups))
    194             res.cov_params_default = sw.cov_cluster(self, groups,
--> 195                                              use_correction=use_correction)
    196 
    197         elif groups.ndim == 2:
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\stats\sandwich_covariance.pyc in cov_cluster(results, group, use_correction)
    528     '''
    529     #TODO: currently used version of groupsums requires 2d resid
--> 530     xu, hessian_inv = _get_sandwich_arrays(results)
    531 
    532     if not hasattr(group, 'dtype') or group.dtype != np.dtype('int'):
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\stats\sandwich_covariance.pyc in _get_sandwich_arrays(results)
    239         elif hasattr(results.model, 'score_obs'):
    240             xu = results.model.score_obs(results.params)
--> 241             hessian_inv = np.linalg.inv(results.model.hessian(results.params))
    242         else:
    243             xu = results.model.wexog * results.wresid[:, None]
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\genmod\generalized_linear_model.pyc in hessian(self, params, scale, observed)
    418         """
    419 
--> 420         factor = self.hessian_factor(params, scale=scale, observed=observed)
    421         hess = -np.dot(self.exog.T * factor, self.exog)
    422         return hess
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\genmod\generalized_linear_model.pyc in hessian_factor(self, params, scale, observed)
    378             raise RuntimeError('something wrong')
    379 
--> 380         tmp = self.family.variance(mu) * self.family.link.deriv2(mu)
    381         tmp += self.family.variance.deriv(mu) * self.family.link.deriv(mu)
    382 
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\genmod\families\links.pyc in deriv2(self, p)
     69         from statsmodels.tools.numdiff import approx_fprime_cs
     70         # TODO: workaround proplem with numdiff for 1d
---> 71         return np.diag(approx_fprime_cs(p, self.deriv))
     72 
     73     def inverse_deriv(self, z):
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\tools\numdiff.pyc in approx_fprime_cs(x, f, epsilon, args, kwargs)
    182     n = len(x)
    183     epsilon = _get_epsilon(x, 1, epsilon, n)
--> 184     increments = np.identity(n) * 1j * epsilon
    185     #TODO: see if this can be vectorized, but usually dim is small
    186     partials = [f(x+ih, *args, **kwargs).imag / epsilon[i]
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\numpy\core\numeric.pyc in identity(n, dtype)
   2306     """
   2307     from numpy import eye
-> 2308     return eye(n, dtype=dtype)
   2309 
   2310 def allclose(a, b, rtol=1.e-5, atol=1.e-8, equal_nan=False):
C:\Users\chamar.stu\AppData\Local\Continuum\Anaconda2\lib\site-packages\numpy\lib\twodim_base.pyc in eye(N, M, k, dtype)
    231     if M is None:
    232         M = N
--> 233     m = zeros((N, M), dtype=dtype)
    234     if k >= M:
    235         return m
MemoryError: 
What am I doing wrong?
Thanks!

josef...@gmail.com

unread,
Sep 8, 2016, 12:51:57 AM9/8/16
to pystatsmodels
What's your sample size nobs, number of exog and how many uniques are
in your groups?
and which version of statsmodels?

It breaks in the hessian_factor calculation which is slow, but I think
it shouldn't use a lot of memory.
I'm not sure why it is using the numerical derivative of the link at
the end, I thought this was changed to analytical derivatives.

I don't see anything wrong in your usage of the model, so it's the
size of your data and/or some inefficiencies in the implementation.

Also did you try discrete.Logit for this case which uses a more
specific hessian calculation? AFAIR, there are some redundancies in
GLM hessian with canonical link like binomial logit.
Using Logit would be just an alternative, if there are problems in
GLM, then they still need to be fixed.

Josef

Charles Martineau

unread,
Sep 8, 2016, 1:57:05 AM9/8/16
to pystatsmodels
Hi Josef,

thanks for your response. I have 84,000 observations in my regression. I have 10 indep variables. My dep variables is between 0 and 1. I have 20% of my data that equals 0 and 20% that equal 1. Maybe it is because of the distribution of the data? The histogram of the data looks like a u-shape. I have no problem running the GLM without the robust standard errors. 

My version of statsmodels is 0.6.1

Can you expand on discrete.logit ... ? How can I call this?

Thanks Josef,

josef...@gmail.com

unread,
Sep 8, 2016, 8:35:48 AM9/8/16
to pystatsmodels
On Thu, Sep 8, 2016 at 1:57 AM, Charles Martineau
<martinea...@gmail.com> wrote:
> Hi Josef,
>
> thanks for your response. I have 84,000 observations in my regression. I
> have 10 indep variables. My dep variables is between 0 and 1. I have 20% of
> my data that equals 0 and 20% that equal 1. Maybe it is because of the
> distribution of the data? The histogram of the data looks like a u-shape. I
> have no problem running the GLM without the robust standard errors.

nobs=84000 is much larger than our test cases, but still small.
There should not be problems at least up to a few million observations.

However, the numerical Hessian calculation for the Link requires the
wrong vectorization for our numdiff functions and builds up (nobs,
nobs) arrays, AFAICS and AFAIR.

Logit link has been fixed, but CDFLink including Probit link still has
that problem. (I will open an issue for CDFLink, but it will not be
fixed soon.)

>
> My version of statsmodels is 0.6.1

if you don't want to work with Logit, then you need to update statsmodels.

>
> Can you expand on discrete.logit ... ? How can I call this?

`sm.Logit` is defined in discrete/discrete_model
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html
there are also examples/notebooks in the documentation.

The results should be identical up to convergence precision, but GLM
and the `discrete` models have some non-overlapping post-estimation
features.

Josef
Reply all
Reply to author
Forward
0 new messages