OLS hac-panel + cluster standard errors

Giacomo Marangoni

unread,

Apr 14, 2016, 9:16:52 AM4/14/16

to pystatsmodels

Dear all,
is it possible to specify both 'hac-panel' and an index to cluster standard error for robust OLS covariance calculations?

Thanks,

Giacomo M

josef...@gmail.com

unread,

Apr 14, 2016, 10:01:44 AM4/14/16

to pystatsmodels

On Thu, Apr 14, 2016 at 9:01 AM, Giacomo Marangoni <jack...@gmail.com> wrote:

Dear all,
is it possible to specify both 'hac-panel' and an index to cluster standard error for robust OLS covariance calculations?

I'm not sure I understand what you mean

`hac-panel` where the keyword is actually `nw-panel` calculates the hac kernel sum for each time series defined by groups, and then aggregates, if I read the code and remember correctly.

Based on the code in regression:

the group_idx is internally calculated based on the time index, under the assumption that we have equal spaced time periods with no missing values in the interior (times series for individual panel units can differ in length as in unbalanced panel but only by truncation at the beginning or end).

It looks like the time index is only used to calculate where panel units begin in the array. The time index or period labels themselves are not used.

`nw-groupsum` (Driscoll Kraay) uses time periods as labels to sum over all observations with the same time label, and then calculates the hac kernel over the sums for each period, assuming that the array with cross-section sums is a time series with equal spaced periods.

cluster_2groups: this just aggregates according to the labels of the two groups.

not implemented:

unequal spaced hac plus groups:

An *obvious* extension would have been to allow for kernels as in newey west or similar for arbitrary distance measures based on time periods interpreted in continuous time (or points in space, or any other distance measure) and allow for groups in another direction.

This would interpret the "time" index as actual location for calculating the distance between two observations, and `groups` as index for discrete 0-1 distance.

I gave up on implementing this because I didn't find a reference and it got a bit messy to implement. IIRC I stopped half way through implementing this generic kernel covariance.

(Now that I think about it again, this might be a similar application as the product kernels for mixed continuous and discrete variables in kde and kernel regression.)

Does this help, or can you clarify your question?

Josef

Thanks,

Giacomo M

Giacomo Marangoni

unread,

Apr 14, 2016, 6:19:14 PM4/14/16

to pystatsmodels

Thanks a lot Josef. It definitely helps. Just a few more questions: if I'm fitting an OLS object where my variables are defined on both "individual" and "time" indices, and I have individuals fixed effects, and potential serial correlations over time, I could use .fit(cov_type='nw-panel', cov_kwds={'groups':'individual', 'time':'time'}), correct? If I have both individuals and time fixed effects, should I use cluster_2groups? In this case how do I specify two groups in cov_kwds?
Fortunately I have equally spaced time series, even though not always full, I'll have to interpolate then.

Thanks again for you help, very appreciated,

Giacomo

josef...@gmail.com

unread,

Apr 14, 2016, 7:04:53 PM4/14/16

to pystatsmodels

On Thu, Apr 14, 2016 at 6:19 PM, Giacomo Marangoni <jack...@gmail.com> wrote:

On Thursday, April 14, 2016 at 4:01:44 PM UTC+2, josefpktd wrote:

On Thu, Apr 14, 2016 at 9:01 AM, Giacomo Marangoni <jack...@gmail.com> wrote:
Dear all,
is it possible to specify both 'hac-panel' and an index to cluster standard error for robust OLS covariance calculations?

I'm not sure I understand what you mean

`hac-panel` where the keyword is actually `nw-panel` calculates the hac kernel sum for each time series defined by groups, and then aggregates, if I read the code and remember correctly.

Based on the code in regression:
the group_idx is internally calculated based on the time index, under the assumption that we have equal spaced time periods with no missing values in the interior (times series for individual panel units can differ in length as in unbalanced panel but only by truncation at the beginning or end).
It looks like the time index is only used to calculate where panel units begin in the array. The time index or period labels themselves are not used.

`nw-groupsum` (Driscoll Kraay) uses time periods as labels to sum over all observations with the same time label, and then calculates the hac kernel over the sums for each period, assuming that the array with cross-section sums is a time series with equal spaced periods.

cluster_2groups: this just aggregates according to the labels of the two groups.

not implemented:
unequal spaced hac plus groups:
An *obvious* extension would have been to allow for kernels as in newey west or similar for arbitrary distance measures based on time periods interpreted in continuous time (or points in space, or any other distance measure) and allow for groups in another direction.
This would interpret the "time" index as actual location for calculating the distance between two observations, and `groups` as index for discrete 0-1 distance.
I gave up on implementing this because I didn't find a reference and it got a bit messy to implement. IIRC I stopped half way through implementing this generic kernel covariance.
(Now that I think about it again, this might be a similar application as the product kernels for mixed continuous and discrete variables in kde and kernel regression.)

Does this help, or can you clarify your question?

Thanks a lot Josef. It definitely helps. Just a few more questions: if I'm fitting an OLS object where my variables are defined on both "individual" and "time" indices, and I have individuals fixed effects, and potential serial correlations over time, I could use .fit(cov_type='nw-panel', cov_kwds={'groups':'individual', 'time':'time'}), correct?

No in this case groups will be silently ignored.

silently, because there is no check for keywords depending on each case.

It's recommended to check that a keyword actually has an effect on the standard errors by running with and without keyword.

If I have both individuals and time fixed effects, should I use cluster_2groups? In this case how do I specify two groups in cov_kwds?
Fortunately I have equally spaced time series, even though not always full, I'll have to interpolate then.

`nw-groupsum` (Driscoll Kraay) can have gaps in individual timeseries, but I doubt I had used a unit test for that.

groups for two cluster should be either a 2d array (nobs, 2) (*) or a tuple or similar with 2 1d arrays

e.g. groups=(self.groups, self.time)

(if groups doesn't have a `shape` attribute, then we do `np.asarray(groups).T` to get the two arrays into columns. same effect as column_stack)

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/tests/test_robustcov.py#L627

cov_type='cluster', cov_kwds={'groups': groups}

should do it

I often have to check the unit tests to see what is actually used. (unit tests are a bit spread out, but the above is the main one for linear regression models.)

(*) I'm not sure whether groups can be a DataFrame.

Josef

josef...@gmail.com

unread,

Apr 14, 2016, 7:21:49 PM4/14/16

to pystatsmodels

related but different story:

I was recently reading up on unbalanced panel data where missing observation (the unbalancedness) is not exogenous, not completely missing at random. One approach in statistics is to use something similar to inverse propensity score weighting if we have explanatory variables for the missingness probability. So it's related to the work on getting weights into all models outside of tsa.

The idea that could be used here is to complete the panel to have no gaps, but set the weight of the filled rows to zero. I haven't tried it yet but it should work for the main results. One problem with this is that the degrees of freedom are wrong, which would have to be fixed up.

Josef

josef...@gmail.com

unread,

Apr 15, 2016, 5:34:10 PM4/15/16

to pystatsmodels

I haven't advertised it in this thread yet

http://nbviewer.jupyter.org/github/vgreg/python-se/blob/master/Standard%20errors%20in%20Python.ipynb

which I find easier to read or skim than the blog

http://www.vincentgregoire.com/standard-errors-in-python/#OLS-Coefficients-and-White-Standard-Errors

This is largely the same as the unit tests for some of it, because I was also using Petersen as the main reference for panel robust.