This is great! I’ve got the model from Stata as:. nbreg terrorCounts rivalry jointDem1 logcapratio contiguity, nolog cluster(dyadid)Negative binomial regression Number of obs = 65538Dispersion = mean Wald chi2(4) = 90.22Log pseudolikelihood = -4031.0544 Prob > chi2 = 0.0000(Std. Err. adjusted for 2402 clusters in dyadid)------------------------------------------------------------------------------| RobustterrorCounts | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------rivalry | 2.346538 .4670086 5.02 0.000 1.431218 3.261858jointDem1 | 1.323547 .2777433 4.77 0.000 .7791798 1.867913logcapratio | -.152246 .0691311 -2.20 0.028 -.2877405 -.0167514contiguity | 1.01714 .255236 3.99 0.000 .5168868 1.517394_cons | -5.350969 .1810288 -29.56 0.000 -5.705779 -4.996159-------------+----------------------------------------------------------------/lnalpha | 4.171082 .1462081 3.884519 4.457644-------------+----------------------------------------------------------------alpha | 64.78549 9.472164 48.64354 86.28401------------------------------------------------------------------------------So it looks like things match up nicely. I don’t have any R results to compare to since there doesn’t seem to be an easy, straightforward implementation of clustered standard errors. I can link to an implementation that I’ve been using, but it doesn’t match up to the Stata output as well as what you have in the gist.
from statsmodels.stats.sandwich_covariance import S_crosssection
def cov_cluster_mle(jac, hessian_inv, groups, use_correction=True):
S = S_crosssection(jac, groups)
# The next part is _HCCM, which shouldn't require results
xxi = hessian_inv
cov = np.dot(np.dot(xxi, S), xxi.T)
if use_correction:
#add correction
nobs, k_params = jac.shape
n_groups = len(np.unique(groups))
corr_fact = n_groups / (n_groups - 1.) * ((nobs-1.) / float(nobs - k_params))
cov *= corr_fact
return cov
#with jac from NegativeBinomial and group gr
cov = cov_cluster_mle(jac, res.normalized_cov_params, gr, use_correction=True)
print sc.se_cov(cov)
jac_poi = res_poi.model.jac(res_poi.params)
cov_poi = cov_cluster_mle(jac_poi, res_poi.normalized_cov_params, gr, use_correction=True)
clustered_poi = cov_cluster(res_poi, gr)
print '\nbse poisson non-robust'
print np.asarray(res_poi.bse)
print '\nbse poisson cluster-robust, current and new cov_cluster'
print sc.se_cov(clustered_poi)
print sc.se_cov(cov_poi)
bse poisson non-robust
[ 0.05387085 0.06832878 0.06318283 0.02102654 0.06584392]
bse poisson cluster-robust, current and new cov_cluster
[ 0.23681477 0.56355749 0.2755732 0.12775415 0.56583234]
[ 0.23681477 0.56355749 0.2755732 0.12775415 0.56583234]
Just FYI, I updated my clustered_se module today: https://github.com/spillz/sci-comp/blob/master/statsmodels-fun/clustered_se.py
The biggest change is to use the fit result as the function argument in place of the model, params pair
The only reason I mention is that it would be good to take the multi-way clustering logic I have there. This handles arbitrarily many clustering variables. (This is discussed in the Cameron et al paper and on Mitch Petersen's website.)
diffs = list of differences before/after treatment for my samples (which come from a number of groups)
diffs ~ const
get_robustcov_results(cov_type='cluster', df_correction=True, groups=group_IDs)