clustered standard errors

579 views
Skip to first unread message

John B

unread,
Dec 8, 2013, 6:26:08 PM12/8/13
to pystat...@googlegroups.com
Hi All,

I'm trying to estimate clustered standard errors for a negative binomial model. The code I'm trying to use is located in a gist at https://gist.github.com/johnb30/7865106 along with a sample of the data. The traceback I get when I run that code is:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-75-878c59a444a8> in <module>()
----> 1 temp = cov_cluster(res, dat['dyadid'])

/Library/Python/2.7/site-packages/statsmodels/stats/sandwich_covariance.pyc in cov_cluster(results, group, use_correction)
    501     n_groups = len(np.unique(group)) #replace with stored group attributes if available
    502
--> 503     cov_c = _HCCM2(results, scale)
    504
    505     if use_correction:

/Library/Python/2.7/site-packages/statsmodels/stats/sandwich_covariance.pyc in _HCCM2(results, scale)
    270
    271     xxi = results.normalized_cov_params
--> 272     H = np.dot(np.dot(xxi, scale), xxi.T)
    273     return H
    274

ValueError: matrices are not aligned

Let me know if more information is needed to figure out what's going on.

josef...@gmail.com

unread,
Dec 8, 2013, 7:09:12 PM12/8/13
to pystatsmodels
This is not working yet.

sandwich covariance is currently only supported after OLS

maybe in the meantime Damian's version helps, see link to mailing list and code
https://github.com/statsmodels/statsmodels/issues/863

https://github.com/spillz/sci-comp/blob/master/statsmodels-fun/clustered_se.py


My plan is to connect the sandwich covariances with the discrete
models in the next few weeks.
A new OLS method is already in master to figure out the pattern.


The sandwich_covariance currently assume that the score is exog * resid.
From the source of NegativeBinomial I cannot see right now if resid
has the right form.

However, your traceback looks like NegativeBinomial will be more
difficult to combine because we have the extra parameter which I
assume causes the `matrices are not aligned` error.

so maybe we need also the general score version for NegativeBinomial.
(I doubt that the covariance for NegativeBinomial is block-diagonal,
but I don't know).
That case needs some generalization in the sandwich_covariances that I
planned to make soon.

I think it might take just a few lines to get this to work for your
case, (without unit tests !)
but I would like to get a general solution so we don't have to almost
duplicate the code several times
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L1307
and so we need to add new sandwiches only at one place.

Josef

josef...@gmail.com

unread,
Dec 8, 2013, 7:48:38 PM12/8/13
to pystatsmodels
I'm trying to run your example but I run into problems inverting the Hessian
after dropping the missing rows. I use 'bfgs' because 'newton' fails for me

>>> print res0.summary()
NegativeBinomial Regression Results
==============================================================================
Dep. Variable: terrorCounts No. Observations: 210
Model: NegativeBinomial Df Residuals: 205
Method: MLE Df Model: 4
Date: Sun, 08 Dec 2013 Pseudo R-squ.: 0.02561
Time: 19:36:33 Log-Likelihood: -19.335
converged: False LL-Null: -19.843
LLR p-value: 0.9073
===============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept -6.0774 4.600 -1.321 0.186 -15.094 2.939
rivalry 0.4251 1.330 0.320 0.749 -2.182 3.032
jointDem1 -1.4072 1.367 -1.029 0.303 -4.087 1.273
logcapratio 1.0382 1.778 0.584 0.559 -2.446 4.523
contiguity 2.7169 3.194 0.851 0.395 -3.542 8.976
alpha 0.2473 nan nan nan nan nan
===============================================================================

Do you get correct numbers for the `bse` of alpha?

Josef

josef...@gmail.com

unread,
Dec 8, 2013, 8:37:48 PM12/8/13
to pystatsmodels
A few more observations trying this out

the function in sandwich_covariance might not work with pandas datastructures.
I'm not sure yet why I get some exceptions.

missing handling is not built into patsy (0.2.x) and also an option
for the models (missing='drop'), but there is no automatic handling of
the extra arrays required by the sandwiches, groups, time, ...

NegativeBinomial doesn't have `jac`, the scores for each observation
yet, so there is also a bit more work left

Poisson already seems to work for cov_cluster, because it's score has
the exog * resid structure.
Logit also has the same structure.

Josef

>
> Josef

josef...@gmail.com

unread,
Dec 8, 2013, 10:12:57 PM12/8/13
to pystatsmodels
it has scoreobs which should be `jac`

Here's a first try:

https://gist.github.com/josef-pkt/7866696

The first version creates a fake instance that holds the attributes
that the sandwich_covariance currently expects.
The second version uses the pieces from sandwich_covariance but puts
it together with the right attributes.

This version is not verified, obviously. The ols version is verified
against Stata or R. The small sample correction is identical to the
Stata default.

Josef

John Beieler

unread,
Dec 8, 2013, 10:18:27 PM12/8/13
to pystat...@googlegroups.com, josef...@gmail.com
This is great! I’ve got the model from Stata as:

. nbreg terrorCounts rivalry jointDem1 logcapratio contiguity, nolog cluster(dyadid) 

Negative binomial regression                      Number of obs   =      65538
Dispersion           = mean                       Wald chi2(4)    =      90.22
Log pseudolikelihood = -4031.0544                 Prob > chi2     =     0.0000

                              (Std. Err. adjusted for 2402 clusters in dyadid)
------------------------------------------------------------------------------
             |               Robust
terrorCounts |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     rivalry |   2.346538   .4670086     5.02   0.000     1.431218    3.261858
   jointDem1 |   1.323547   .2777433     4.77   0.000     .7791798    1.867913
 logcapratio |   -.152246   .0691311    -2.20   0.028    -.2877405   -.0167514
  contiguity |    1.01714    .255236     3.99   0.000     .5168868    1.517394
       _cons |  -5.350969   .1810288   -29.56   0.000    -5.705779   -4.996159
-------------+----------------------------------------------------------------
    /lnalpha |   4.171082   .1462081                      3.884519    4.457644
-------------+----------------------------------------------------------------
       alpha |   64.78549   9.472164                      48.64354    86.28401
------------------------------------------------------------------------------

So it looks like things match up nicely. I don’t have any R results to compare to since there doesn’t seem to be an easy, straightforward implementation of clustered standard errors. I can link to an implementation that I’ve been using, but it doesn’t match up to the Stata output as well as what you have in the gist.

Thanks for this!

Best,

John
-- 
John Beieler
johnbeieler.org

josef...@gmail.com

unread,
Dec 8, 2013, 11:27:38 PM12/8/13
to pystatsmodels
I just realized that I didn't reply to the mailing list. 
The message below includes the calculations converted to a new function cov_cluster_mle


On Sun, Dec 8, 2013 at 10:18 PM, John Beieler <john...@gmail.com> wrote:
This is great! I’ve got the model from Stata as:

. nbreg terrorCounts rivalry jointDem1 logcapratio contiguity, nolog cluster(dyadid) 

Negative binomial regression                      Number of obs   =      65538
Dispersion           = mean                       Wald chi2(4)    =      90.22
Log pseudolikelihood = -4031.0544                 Prob > chi2     =     0.0000

                              (Std. Err. adjusted for 2402 clusters in dyadid)
------------------------------------------------------------------------------
             |               Robust
terrorCounts |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     rivalry |   2.346538   .4670086     5.02   0.000     1.431218    3.261858
   jointDem1 |   1.323547   .2777433     4.77   0.000     .7791798    1.867913
 logcapratio |   -.152246   .0691311    -2.20   0.028    -.2877405   -.0167514
  contiguity |    1.01714    .255236     3.99   0.000     .5168868    1.517394
       _cons |  -5.350969   .1810288   -29.56   0.000    -5.705779   -4.996159
-------------+----------------------------------------------------------------
    /lnalpha |   4.171082   .1462081                      3.884519    4.457644
-------------+----------------------------------------------------------------
       alpha |   64.78549   9.472164                      48.64354    86.28401
------------------------------------------------------------------------------

So it looks like things match up nicely. I don’t have any R results to compare to since there doesn’t seem to be an easy, straightforward implementation of clustered standard errors. I can link to an implementation that I’ve been using, but it doesn’t match up to the Stata output as well as what you have in the gist.

Thanks for checking with Stata

here is it rewritten as a function, that takes jac that can be calculated with the discrete models
(except it's currently score_obs in NegativeBinomial)
and hessian_inv which is normalized_cov_params attached to the results instance

In the direct version in the gist, I had used k_vars = exog.shape[1] which ignores the extra parameter
I think it should use the number of parameters, k_params = k_vars + 1 = jac.shape[1].

In your case it doesn't make much difference because the number of explanatory variables is small, and the number of observations and groups is large.

BTW: the version that I implemented in cov_cluster uses np.bincount so it should be reasonably fast for a large number of groups, but it has to loop over the number of explanatory variables which is small in your case.


The function below is pretty much how I plan to change sandwich_covariance.py plus make them directly available from the model and results instances.


from statsmodels.stats.sandwich_covariance import S_crosssection



def cov_cluster_mle(jac, hessian_inv, groups, use_correction=True):

    S = S_crosssection(jac, groups)

    # The next part is _HCCM, which shouldn't require results

    xxi = hessian_inv

    cov = np.dot(np.dot(xxi, S), xxi.T)

    

    if use_correction:

        #add correction

        nobs, k_params = jac.shape

        n_groups = len(np.unique(groups))

        corr_fact = n_groups / (n_groups - 1.) * ((nobs-1.) / float(nobs - k_params))

        

        cov *= corr_fact

    

    return cov


#with jac from NegativeBinomial and  group gr 

cov = cov_cluster_mle(jac, res.normalized_cov_params, gr, use_correction=True)

print sc.se_cov(cov)



jac_poi = res_poi.model.jac(res_poi.params)

cov_poi = cov_cluster_mle(jac_poi, res_poi.normalized_cov_params, gr, use_correction=True)

clustered_poi = cov_cluster(res_poi, gr)

print '\nbse poisson non-robust'

print np.asarray(res_poi.bse)

print '\nbse poisson cluster-robust, current and new cov_cluster'

print sc.se_cov(clustered_poi)

print sc.se_cov(cov_poi)


bse poisson non-robust

[ 0.05387085  0.06832878  0.06318283  0.02102654  0.06584392]


bse poisson cluster-robust, current and new cov_cluster

[ 0.23681477  0.56355749  0.2755732   0.12775415  0.56583234]

[ 0.23681477  0.56355749  0.2755732   0.12775415  0.56583234]


Josef

josef...@gmail.com

unread,
Dec 9, 2013, 10:47:16 PM12/9/13
to pystatsmodels
I'm trying to find and write a test case against Stata.

In a very small sample with 34 observations in total and 5 groups and using Poisson, my results differ by a proportionality factor, about 1.5% difference in bse.
So, it looks like Stata is using a different small sample or degrees of freedom correction after Poisson than after a linear model.

reverse engineering:

cov_stata == cov / ((nobs-1.) / float(nobs - k_vars)))

>>> sc.se_cov(cov_clu_poi / ((nobs-1.) / float(nobs - k_vars))) - bse_stata
array([  1.55431223e-15,  -1.13797860e-15])


Also the robust Wald test is a bit off compared to Stata, another difference in small sample correction, I guess.

Josef

Damien Moore

unread,
Dec 12, 2013, 9:57:21 PM12/12/13
to pystat...@googlegroups.com
Just FYI, I updated my clustered_se module today: https://github.com/spillz/sci-comp/blob/master/statsmodels-fun/clustered_se.py

The biggest change is to use the fit result as the function argument in place of the model, params pair

The only reason I mention is that it would be good to take the multi-way clustering logic I have there. This handles arbitrarily many clustering variables. (This is discussed in the Cameron et al paper and on Mitch Petersen's website.)

josef...@gmail.com

unread,
Dec 12, 2013, 10:25:08 PM12/12/13
to pystatsmodels
On Thu, Dec 12, 2013 at 9:57 PM, Damien Moore <damien...@gmail.com> wrote:

FYI: I also changed sandwich_covariance to use the scores/jac and hessian from the models
This should work for all discrete models but I only have unittests for Poisson and NegativeBinomial so far, and also not yet for the other sandwiches, in case anyone wants to use Newey-West HAC with count data :).
 

The biggest change is to use the fit result as the function argument in place of the model, params pair

The only reason I mention is that it would be good to take the multi-way clustering logic I have there. This handles arbitrarily many clustering variables. (This is discussed in the Cameron et al paper and on Mitch Petersen's website.)

your multiway could easily be integrated in my two-way as extension

(I had initially a nice multiway version with sparse boolean addition, IIRC, but I thought it was too cute, so I stuck with the simple 2-way clustering that everyone else also has.)

Actually I would be interested in a timing comparison to see whether your explicit loop over groups or bincount in my version is faster. Using np.bincount has the disadvantage that I need to get the group labels as integers.

Josef

mishate...@gmail.com

unread,
Oct 23, 2014, 2:38:06 PM10/23/14
to pystat...@googlegroups.com
Hi Josef,
I'm not sure where to ask this question, so maybe this is a good place.
I'm trying to use clustered standard errors for a paired t-test, exactly the use-case described here: 
I did the following:
diffs = list of differences before/after treatment for my samples (which come from a number of groups)

I regressed 
diffs ~ const

using OLS. I then used 
get_robustcov_results(cov_type='cluster', df_correction=True, groups=group_IDs)

to get the p-value of the one and only parameter in that model. Is that the right approach? 
Thanks!
Misha

josef...@gmail.com

unread,
Oct 23, 2014, 2:58:36 PM10/23/14
to pystatsmodels
On Thu, Oct 23, 2014 at 2:38 PM, <mishate...@gmail.com> wrote:
> Hi Josef,
> I'm not sure where to ask this question, so maybe this is a good place.
> I'm trying to use clustered standard errors for a paired t-test, exactly the
> use-case described here:
> http://www.stata.com/statalist/archive/2010-05/msg00663.html
> I did the following:
>>
>> diffs = list of differences before/after treatment for my samples (which
>> come from a number of groups)
>
>
> I regressed
>>
>> diffs ~ const
>
>
> using OLS. I then used
>>
>> get_robustcov_results(cov_type='cluster', df_correction=True,
>> groups=group_IDs)
>
>
> to get the p-value of the one and only parameter in that model. Is that the
> right approach?

Yes, sounds right to me, and should give you the same numbers as Stata.
IIRC, the defaults with df_correction=True match Stata in the OLS case.
In other models the small sample corrections can be slightly different
from Stata
`nobs` versus `nobs - k_vars` or similar.

2 comments on usage:

If you use master or the 0.6 release candidate you can specify
cov_type as argument in model.fit( ) with the extra keyword arguments
in the cov_kwds dictionary
cov_kwds= dict(df_correction=True, groups=group_IDs) if I remember
the details correctly

Also, if you want to switch between normal and t distribution you can
use the `use_t` keyword in fit().

Josef
Reply all
Reply to author
Forward
0 new messages