Hansen Hodrick standard errors

esben.h...@me.com

unread,

Sep 8, 2015, 3:40:16 PM9/8/15

to pystatsmodels

How can I use statsmodels OLS to calculate Hansen-Hodrick standard errors?

Suppose I use overlapping data on the left-hand-side, e.g. yearly returns at a monthly frequency.

I want to change the kernel from the default Bartlett used in Newey-West to a constant weight of 1, so I want to write something like

OLS(y,x).fit(cov_type='HAC', cov_kwds={'maxlags':11, kernel=xxx})

On http://statsmodels.sourceforge.net/stable/generated/statsmodels.regression.linear_model.OLSResults.get_robustcov_results.html?highlight=get_robustcov_results, I can see the kernel argument, but not the options.

Thanks,
Esben

josef...@gmail.com

unread,

Sep 8, 2015, 3:58:39 PM9/8/15

to pystatsmodels

I don't know what Hansen-Hodrick standard errors are.

If you just need a uniform truncated (flat top) kernel, then that is available, but I don't remember how this is wired up.

In the underlying functions kernels are just callables or arrays

I can try to find an example tonight.

Josef

Thanks,
Esben

josef...@gmail.com

unread,

Sep 8, 2015, 7:12:07 PM9/8/15

to pystatsmodels

kernel is not wired up for HAC, and internally it's called weights_func (which is available in nw-panel, but incorrectly documented)

Something like this **might** work correctly

>>> import statsmodels.stats.sandwich_covariance as sw

>>> res_flatb = mod.fit(cov_type='nw-panel', cov_kwds=dict(time=np.arange(mod.endog.shape[0]), maxlags=4, weights_func=sw.weights_bartlett))

>>> res_flatb.bse

array([ 0.33125645, 0.29582348, 1.18593381])

>>> res_flat = mod.fit(cov_type='nw-panel', cov_kwds=dict(time=np.arange(mod.endog.shape[0]), maxlags=4, weights_func=sw.weights_uniform))

>>> res_flat.bse

array([ 0.33784466, 0.28640521, 1.15945044])

There won't be any unit tests for this yet. In my last round I wrote unit tests and options mostly against Stata, but Stata's `newey` doesn't have a kernel option (and I focused more on panel and cluster robust standard errors).

It looks like ivreg2 supports it http://www.stata.com/statalist/archive/2011-03/msg00362.html

To wire up the kernel/weights_func is essentially just copy paste of 2 lines, but I need to see if we can get the unit tests for fixing this quickly.

As background: Do you know what the standard reference for robust standard errors for overlapping data is?

I never saw them so far.

Josef

Josef

Thanks,
Esben

josef...@gmail.com

unread,

Sep 8, 2015, 7:24:23 PM9/8/15

to pystatsmodels

On Tue, Sep 8, 2015 at 7:12 PM, <josef...@gmail.com> wrote:

On Tue, Sep 8, 2015 at 3:58 PM, <josef...@gmail.com> wrote:

On Tue, Sep 8, 2015 at 3:33 PM, <esben.h...@me.com> wrote:
How can I use statsmodels OLS to calculate Hansen-Hodrick standard errors?

Suppose I use overlapping data on the left-hand-side, e.g. yearly returns at a monthly frequency.

I want to change the kernel from the default Bartlett used in Newey-West to a constant weight of 1, so I want to write something like
OLS(y,x).fit(cov_type='HAC', cov_kwds={'maxlags':11, kernel=xxx})

On http://statsmodels.sourceforge.net/stable/generated/statsmodels.regression.linear_model.OLSResults.get_robustcov_results.html?highlight=get_robustcov_results, I can see the kernel argument, but not the options.

I don't know what Hansen-Hodrick standard errors are.

If you just need a uniform truncated (flat top) kernel, then that is available, but I don't remember how this is wired up.
In the underlying functions kernels are just callables or arrays

I can try to find an example tonight.

kernel is not wired up for HAC, and internally it's called weights_func (which is available in nw-panel, but incorrectly documented)

Something like this **might** work correctly

>>> import statsmodels.stats.sandwich_covariance as sw

>>> res_flatb = mod.fit(cov_type='nw-panel', cov_kwds=dict(time=np.arange(mod.endog.shape[0]), maxlags=4, weights_func=sw.weights_bartlett))
>>> res_flatb.bse
array([ 0.33125645, 0.29582348, 1.18593381])

>>> res_flat = mod.fit(cov_type='nw-panel', cov_kwds=dict(time=np.arange(mod.endog.shape[0]), maxlags=4, weights_func=sw.weights_uniform))
>>> res_flat.bse
array([ 0.33784466, 0.28640521, 1.15945044])

There won't be any unit tests for this yet. In my last round I wrote unit tests and options mostly against Stata, but Stata's `newey` doesn't have a kernel option (and I focused more on panel and cluster robust standard errors).

It looks like ivreg2 supports it http://www.stata.com/statalist/archive/2011-03/msg00362.html

There is the comment

"If I'm not mistaken, Hansen-Hodrick SEs are the same as using the

truncated kernel and assuming homoskedasticity."

We don't actually have AC without H, using a truncated uniform or flattop kernel would be HAC not AC.

Adding AC is mentioned here https://github.com/statsmodels/statsmodels/issues/1158 but since I never saw an application of AC standard errors it never went high in my priorities.

Josef

esben.h...@me.com

unread,

Sep 9, 2015, 10:26:16 AM9/9/15

to pystatsmodels

Hi Josef,

Thanks a lot, I'll try your suggestion.

The canonical reference is Hansen and Hodrick (1980): Forward Exchange Rates as Optimal Predictors of Future Spot Rates: An Econometric Analysis. Journal of Political Economy, 1980 Volume 88, number 51.

The need for Hansen-Hodrick standard errors shows up a lot when working with overlapping data in finance. Suppose you want to predict annual returns on the stock market, but that you sample the data monthly. On the left-hand-side you now have annual returns and observations next to each other have 11 month of data in common.

The spectral density matrix S is the sum (as usual) over all the cross-second moments of g_t = x_t u_t:

S = \sum_{j=-\infty}^{\infty} E(g_t g_{t+j}' )

Newey-West standard errors are good when we don't know the correlation structure. They essentially down-weight the estimates of E(g_t g_{t+j}' ) as j grows.

In the above example, we know the structure of the overlap, and we need to include 11 lags exactly (with equal weight). There's no guarantee that the resulting matrix will be positive definite, but it's the right thing to do.

The comment

"If I'm not mistaken, Hansen-Hodrick SEs are the same as using the truncated kernel and assuming homoskedasticity."

is correct. Hansen and Hodrick wrote their paper before GMM was developed, so they focused on the homoscedastic case. It seems like the name stuck, so now people say Hansen-Hodrick standard errors when they use GMM standard errors with a truncated, equal-weight, kernel.

Thanks for all your work,

Esben

josef...@gmail.com

unread,

Sep 9, 2015, 11:05:13 AM9/9/15

to pystatsmodels

On Wed, Sep 9, 2015 at 10:26 AM, <esben.h...@me.com> wrote:

Hi Josef,

Thanks a lot, I'll try your suggestion.

The canonical reference is Hansen and Hodrick (1980): Forward Exchange Rates as Optimal Predictors of Future Spot Rates: An Econometric Analysis. Journal of Political Economy, 1980 Volume 88, number 51.

The need for Hansen-Hodrick standard errors shows up a lot when working with overlapping data in finance. Suppose you want to predict annual returns on the stock market, but that you sample the data monthly. On the left-hand-side you now have annual returns and observations next to each other have 11 month of data in common.

The spectral density matrix S is the sum (as usual) over all the cross-second moments of g_t = x_t u_t:

S = \sum_{j=-\infty}^{\infty} E(g_t g_{t+j}' )

Newey-West standard errors are good when we don't know the correlation structure. They essentially down-weight the estimates of E(g_t g_{t+j}' ) as j grows.

In the above example, we know the structure of the overlap, and we need to include 11 lags exactly (with equal weight). There's no guarantee that the resulting matrix will be positive definite, but it's the right thing to do.

One of the main problems with flattop, truncated uniform kernels is that the covariance estimate might not be positive definite. I never tried an example (except panel and cluster robust cases), but it was the original motivation for Newey-West.

I'm not sure whether we should or have to add in any safeguards if we include truncated uniform kernels for time series.

The comment
"If I'm not mistaken, Hansen-Hodrick SEs are the same as using the truncated kernel and assuming homoskedasticity."
is correct. Hansen and Hodrick wrote their paper before GMM was developed, so they focused on the homoscedastic case. It seems like the name stuck, so now people say Hansen-Hodrick standard errors when they use GMM standard errors with a truncated, equal-weight, kernel.

I found a short section in Cochran (via google books) that mentions that we can as well add heteroscedasticity robustness when calculating this. (roughly paraphrased)

A related question:

If we do k-step ahead regression, then k-residuals are autocorrelated.

When I was reading ivreg2 related documentation and articles I saw autocorrelation tests that start after some given lags. At the time I didn't see any use case, but this might be the same application.

Do you have applications also for that?

Josef

esben.h...@me.com

unread,

Sep 9, 2015, 2:00:45 PM9/9/15

to pystatsmodels

Cochrane is a good source for this stuff.

Your suggestion gives the correct answer:

model.fit(cov_type='nw-panel', cov_kwds={'time':np.arange(model.endog.shape[0]), 'maxlags':n, 'weights_func':sw.weights_uniform})

Of course, longer term it would be nice to be able to do this for a time-series without calling the panel-data functionality.

I don't have an explicit application for the autocorrelation example you mention. In financial applications we usually know the overlap, so there's no need to test anything.

Completely unrelated to this: Is there any integration with pandas for panel data? Suppose you have two pandas DataFrames for x and y in which the index is time and the columns are entities. You want to run a panel regression with various options like cluster by time or entity, time FE and so on. Is there an easy way to do this, avoiding stacking the data and setting up the time-index yourself (the stacked dataframe has a multi-level-index, which is annoying in this application). Like writing OLS(y,x).fit('cluster', groups=y.index, ...) or OLS(y,x).fit('cluster', groups=y.columns, fixed_effects=y.time, ...) and then the code figures out the stacking and df corrections?

Esben

josef...@gmail.com

unread,

Sep 9, 2015, 2:41:02 PM9/9/15

to pystatsmodels

On Wed, Sep 9, 2015 at 2:00 PM, <esben.h...@me.com> wrote:

Cochrane is a good source for this stuff.

Your suggestion gives the correct answer:
model.fit(cov_type='nw-panel', cov_kwds={'time':np.arange(model.endog.shape[0]), 'maxlags':n, 'weights_func':sw.weights_uniform})
Of course, longer term it would be nice to be able to do this for a time-series without calling the panel-data functionality.

good, as I mentioned we just need to copy the lines from nw-panel to HAC to make it work, so this could be done fast.

writing the unit tests is the slow part.

I don't have an explicit application for the autocorrelation example you mention. In financial applications we usually know the overlap, so there's no need to test anything.

I thought of testing things like efficient market hypothesis. We know there is correlation in the overlap by construction, but we would like to test whether there is correlation beyond that.

Completely unrelated to this: Is there any integration with pandas for panel data? Suppose you have two pandas DataFrames for x and y in which the index is time and the columns are entities. You want to run a panel regression with various options like cluster by time or entity, time FE and so on. Is there an easy way to do this, avoiding stacking the data and setting up the time-index yourself (the stacked dataframe has a multi-level-index, which is annoying in this application). Like writing OLS(y,x).fit('cluster', groups=y.index, ...) or OLS(y,x).fit('cluster', groups=y.columns, fixed_effects=y.time, ...) and then the code figures out the stacking and df corrections?

Not yet.

Most plans for panel data require the stacked arrays because estimation with unbalanced panels needs the stacked long panel format. IIRC the panel PR assumes and uses the multi-index.

If you have the stacked data, then it would be easy to create FE dummies with patsy and formulas.

Another issue is that nobody started with migrating the models in pandas.stats to statsmodels.

For the balanced panel data case, there are some models with multivariate y in the (stalled) sysreg PR.

If the main task is to change from wide to long format, then I think it would be possible to add a `from_wide` class method for constructing the model.

However, I don't fully understand your case:

If exog x has several expanatory variables with entity specific values, then x would have to be 3d. Or, do you regress all y columns on the same x matrix?

Also do you have a balanced panel or missing values in the wide format?

Hopefully we get some of these panel methods merged later this year.

Josef

Esben

esben.h...@me.com

unread,

Sep 9, 2015, 3:39:56 PM9/9/15

to pystatsmodels

>However, I don't fully understand your case:

>If exog x has several expanatory variables with entity specific values, then x would have to be 3d. Or, do you regress all y columns on the same x matrix?

Right, I wasn't being specific enough. In general, I have a y DataFrame with index=time and columns=entity and several x DataFrames with the same layout. There are missing values, so in general the panel is unbalanced.

In my company, we're transitioning from pandas ols to statsmodels, and people are having trouble organizing the data the right way. Statsmodels has the functionality we need, so I don't think we want to wrap that and hide it from the researcher. Instead, what I think we need is a pre-formatting step which takes the y and (multiple) x DataFrames, stacks them correctly, and returns a DataFrame with the stacked data along with two series containing time- and entity. This would allow us to do clustering easily.

For fixed effects, I'm not sure if one should demean before calling statsmodels and then correct for the df, or if one should use a design matrix and let statsmodels handle it. The former seems more error-prone, and it would be best to have statsmodels handle everything.

I'll try to look at formulas and patsy (haven't used these yet). Let me know if you have any links to examples along these lines.

Thanks!

On Tuesday, September 8, 2015 at 3:40:16 PM UTC-4, esben.h...@me.com wrote:

Reply all

Reply to author

Forward