Choice of Covariance Estimator in OLS

Kevin Sheppard

unread,

Oct 27, 2013, 6:29:35 PM10/27/13

to pystat...@googlegroups.com

I stumbled across

https://groups.google.com/d/msg/pystatsmodels/T1IOsjOPnRI/Nm1R3JL6GDoJ

which discussed using other covariance estimators in place of the homoskedastic estimator. I have dug through RegressionResults and OLSResults but couldn't see any method to set the "default" covariance estimator.

Thanks,

Kevin

josef...@gmail.com

unread,

Oct 27, 2013, 7:55:40 PM10/27/13

to pystatsmodels

Hi Kevin,

That's unfortunately still not possible.
Mainly because we couldn't decide on where or how to set it. Reusing
the same results instance is messy.
And, should we switch to using normal distribution for the asymptotic
variance or still use the t distribution after OLS or WLS/GLS?

In my General Method of Moments cleanup, I set now all the covariance
options in the fit method.
I also want to see if we can cleanly clone a results instance that
then can have a resetting with a different default covariance, so we
don't need to reestimate when switching sandwiches.

Josef

>
> Thanks,
> Kevin
>

Kevin Sheppard

unread,

Oct 28, 2013, 4:45:40 AM10/28/13

to pystat...@googlegroups.com

I thought about 3 locations for it:

1. Model, an an option. This seems a bit strange since it isn't really part of the model, but is similar to some statistical software packages

2. fit() which I think is the correct place. fit() has the same meaning as an estimate() method would have, and since the covariances are estimated, it makes sense to that the most similar to estimate() would do the same

3. OLSResults (or RegressionResult) as a setter method which would allow the covariance to be set and everything else to be updated.

I had a go at generating my own results by calling OLSResult directly but the choice to use normalized_covariance makes this difficult since a HAC estimator needs to be be re-scaled to produce correct output in, e.g. summary().

As for t/normal, I probably have a weak preference for the normal since the t is only justified in a very small number of cases (small = 1). Certainly if using a HAC covariance estimator then the normal is the correct choice. This said, it makes little difference. Ideally, this would also be settable for OLS/WLS (but not other models since these do not have the justification for the Students's t)

Kevin

Kevin Sheppard

unread,

Oct 28, 2013, 7:26:19 AM10/28/13

to pystat...@googlegroups.com

A bit more looking in RegressionModel really indicates that fit() is the place, at least if the docstring is to be honored:

The results include an estimate of covariance matrix, (whitened) residuals and an estimate of scale.

This putting it here shouldn't cause any issues for WLS or OLS since the same covariance estimators can be used for both.

josef...@gmail.com

unread,

Oct 28, 2013, 8:58:10 AM10/28/13

to pystatsmodels

My scepticism about adding it to `fit` itself was because it is not a
`fit` option (except in GMM).
https://github.com/statsmodels/statsmodels/issues/276
Estimating the parameters doesn't care about which covariance will be
used and covariance options would add additional arguments to fit that
are not relevant for fitting.
We can also get the results for different robust covariances without
reestimating the model.

However, fit is also where the results instance is constructed, and
adding it as fit option would be an obvious place for users.
Theoretically, all we would need is a method to create a result
instance with different defaults.
(Currently only `fit` knows how to create a results instance.)

I looked a bit where covariances are used in the models and results:
We cannot just replace normalized_cov_params without additional
refactoring, since it's also used by the robust covariances itself.
However, as far as I can see, everything that depends on the
covariance of the parameter estimate goes through `cov_params` so we
would just need the option there before the cached attributes are
created.

Josef

josef...@gmail.com

unread,

Oct 28, 2013, 1:51:06 PM10/28/13

to pystatsmodels

On Mon, Oct 28, 2013 at 4:45 AM, Kevin Sheppard
<kevin.k....@gmail.com> wrote:
> I thought about 3 locations for it:
>
> 1. Model, an an option. This seems a bit strange since it isn't really
> part of the model, but is similar to some statistical software packages
> 2. fit() which I think is the correct place. fit() has the same meaning as
> an estimate() method would have, and since the covariances are estimated, it
> makes sense to that the most similar to estimate() would do the same
> 3. OLSResults (or RegressionResult) as a setter method which would allow the
> covariance to be set and everything else to be updated.
>
> I had a go at generating my own results by calling OLSResult directly but
> the choice to use normalized_covariance makes this difficult since a HAC
> estimator needs to be be re-scaled to produce correct output in, e.g.
> summary().
>
> As for t/normal, I probably have a weak preference for the normal since the
> t is only justified in a very small number of cases (small = 1). Certainly
> if using a HAC covariance estimator then the normal is the correct choice.
> This said, it makes little difference. Ideally, this would also be settable
> for OLS/WLS (but not other models since these do not have the justification
> for the Students's t)

That's another refactoring that is required
Currently the distribution, t or normal, is hardcoded into the model class.
We had a `use_t` option that we dropped at some point, and that we
need to add again to be able to change the default distribution.

Related:
In various applications or simulation studies I saw it mentioned that
the t or F distribution has better small sample properties than the
normal or chisquare, even if we only have the theory for asymptotic
normality.
Similar, degrees of freedom corrections in small sample, Stata has
quite a variety in the estimators for whether they include them or
not.
https://github.com/statsmodels/statsmodels/issues/285

one more issue to adjust in the results for robust standard errors
fvalue assumes homoscedasticity, and needs to be replaced if we have robust cov.
https://github.com/statsmodels/statsmodels/issues/1104

I'm going to look into this

in Stata:
regress only has HC robust vce, HAC requires a separate estimation
command (`newey`)
Stata still reports t-distribution based inference after newey
ivreg2 has more options for vce, but uses normal and no degrees of
freedom correction by default
With the `small` option `ivreg2` replicates `newey` (dof correction
plus t-distribution)

Too many options and decisions for defaults? And often I don't know
what we should choose. :)

Example: In my current GMM rewrite I just kept adding options until I
was able to match most cases in Stata
https://github.com/statsmodels/statsmodels/pull/1105/files#diff-9e83f785185b6b0f63b31975ac76bb2dR382

We have unit tests for robust cov in various places, but only for bse
and cov_params.
HAC for panel data and cluster robust covariances require one or
several additional function arguments.

Starting at the end:
I would be a localized and relatively easy change to create a
parameter table or the full summary() for robust covariances.

sorry if my replies are too long.
I recently opened several new tickets to try to get some structure
into these robust covariances, and I would appreciate any second
opinions on this.

(For example, I saw only recently some articles that spell out the
properties of robust covariance matrices after feasible GLS with
estimated heteroscedasticity, instead of reading just hints to it.
Currently we have robust covariances only after OLS.)

Josef

josef...@gmail.com

unread,

Oct 28, 2013, 11:39:20 PM10/28/13

to pystatsmodels

I did a bit of work to get the inside wiring set up (creating a new results instance)

https://github.com/josef-pkt/statsmodels/compare/robust_cov

and opened another issue https://github.com/statsmodels/statsmodels/issues/1158

so far I'm matching the `small` option of ivreg2 (or regress robust and newey)

lot's of things are still missing

(no option to use normal distribution, HAC has only Bartlett kernel, no automatic lag selection, ...)

Josef

current example:

----------------

import numpy as np

from statsmodels.regression.linear_model import OLS, GLSAR

from statsmodels.tools.tools import add_constant

from statsmodels.datasets import macrodata

import statsmodels.regression.tests.results.results_macro_ols_robust as res

d2 = macrodata.load().data

g_gdp = 400*np.diff(np.log(d2['realgdp']))

g_inv = 400*np.diff(np.log(d2['realinv']))

exogg = add_constant(np.c_[g_gdp, d2['realint'][:-1]], prepend=False)

res_olsg = OLS(g_inv, exogg).fit()

print res_olsg.summary()

res_hc0 = res_olsg.get_robustcov_results('HC1')

print '\n\n'

print res_hc0.summary()

print '\n\n'

res_hac4 = res_olsg.get_robustcov_results('HAC', maxlags=4, use_correction=True)

print res_hac4.summary()

---------------

OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.677
Model: OLS Adj. R-squared: 0.674
Method: Least Squares F-statistic: 208.5
Date: Mon, 28 Oct 2013 Prob (F-statistic): 1.47e-49
Time: 23:07:06 Log-Likelihood: -763.98
No. Observations: 202 AIC: 1534.
Df Residuals: 199 BIC: 1544.
Df Model: 2
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 4.3742 0.215 20.374 0.000 3.951 4.798
x2 -0.6140 0.285 -2.157 0.032 -1.175 -0.053
const -9.4817 1.068 -8.874 0.000 -11.589 -7.375
==============================================================================
Omnibus: 33.257 Durbin-Watson: 2.214
Prob(Omnibus): 0.000 Jarque-Bera (JB): 51.512
Skew: -0.913 Prob(JB): 6.52e-12
Kurtosis: 4.668 Cond. No. 7.02
==============================================================================

OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.677
Model: OLS Adj. R-squared: 0.674
Method: Least Squares F-statistic: 92.95
Date: Mon, 28 Oct 2013 Prob (F-statistic): 3.13e-29
Time: 23:07:06 Log-Likelihood: -763.98
No. Observations: 202 AIC: 1534.
Df Residuals: 199 BIC: 1544.
Df Model: 2
Covariance Type: HC1
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 4.3742 0.324 13.519 0.000 3.736 5.012
x2 -0.6140 0.328 -1.873 0.062 -1.260 0.032
const -9.4817 1.369 -6.926 0.000 -12.181 -6.782
==============================================================================
Omnibus: 33.257 Durbin-Watson: 2.214
Prob(Omnibus): 0.000 Jarque-Bera (JB): 51.512
Skew: -0.913 Prob(JB): 6.52e-12
Kurtosis: 4.668 Cond. No. 7.02
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC1)

OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.677
Model: OLS Adj. R-squared: 0.674
Method: Least Squares F-statistic: 89.45
Date: Mon, 28 Oct 2013 Prob (F-statistic): 1.93e-28
Time: 23:07:06 Log-Likelihood: -763.98
No. Observations: 202 AIC: 1534.
Df Residuals: 199 BIC: 1544.
Df Model: 2
Covariance Type: HAC
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 4.3742 0.331 13.205 0.000 3.721 5.027
x2 -0.6140 0.296 -2.076 0.039 -1.197 -0.031
const -9.4817 1.186 -7.995 0.000 -11.820 -7.143
==============================================================================
Omnibus: 33.257 Durbin-Watson: 2.214
Prob(Omnibus): 0.000 Jarque-Bera (JB): 51.512
Skew: -0.913 Prob(JB): 6.52e-12
Kurtosis: 4.668 Cond. No. 7.02
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 4 lags and with small sample correction

============================

Stata

. ivreg2 g_realinv g_realgdp L.realint, small

OLS estimation
--------------

Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only

Number of obs = 202
F( 2, 199) = 208.53
Prob > F = 0.0000
Total (centered) SS = 70582.33535 Centered R2 = 0.6770
Total (uncentered) SS = 72725.6805 Uncentered R2 = 0.6865
Residual SS = 22799.67822 Root MSE = 10.7

------------------------------------------------------------------------------
g_realinv | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
g_realgdp | 4.374222 .2146941 20.37 0.000 3.950854 4.797589
|
realint |
L1. | -.613997 .2846681 -2.16 0.032 -1.17535 -.0526439
|
_cons | -9.481673 1.068496 -8.87 0.000 -11.5887 -7.374645
------------------------------------------------------------------------------
Included instruments: g_realgdp L.realint
------------------------------------------------------------------------------

. ivreg2 g_realinv g_realgdp L.realint, robust small

OLS estimation
--------------

Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity

Number of obs = 202
F( 2, 199) = 92.95
Prob > F = 0.0000
Total (centered) SS = 70582.33535 Centered R2 = 0.6770
Total (uncentered) SS = 72725.6805 Uncentered R2 = 0.6865
Residual SS = 22799.67822 Root MSE = 10.7

------------------------------------------------------------------------------
| Robust
g_realinv | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
g_realgdp | 4.374222 .3235545 13.52 0.000 3.736186 5.012257
|
realint |
L1. | -.613997 .3277284 -1.87 0.062 -1.260263 .0322692
|
_cons | -9.481673 1.369059 -6.93 0.000 -12.1814 -6.781947
------------------------------------------------------------------------------
Included instruments: g_realgdp L.realint
------------------------------------------------------------------------------

. ivreg2 g_realinv g_realgdp L.realint, robust bw(5) small

OLS estimation
--------------

Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity and autocorrelation
kernel=Bartlett; bandwidth= 5
time variable (t): qu

Number of obs = 202
F( 2, 199) = 89.45
Prob > F = 0.0000
Total (centered) SS = 70582.33535 Centered R2 = 0.6770
Total (uncentered) SS = 72725.6805 Uncentered R2 = 0.6865
Residual SS = 22799.67822 Root MSE = 10.7

------------------------------------------------------------------------------
| Robust
g_realinv | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
g_realgdp | 4.374222 .3312564 13.20 0.000 3.720998 5.027445
|
realint |
L1. | -.613997 .2958235 -2.08 0.039 -1.197348 -.0306459
|
_cons | -9.481673 1.185934 -8.00 0.000 -11.82028 -7.143063
------------------------------------------------------------------------------
Included instruments: g_realgdp L.realint
------------------------------------------------------------------------------

.

Kevin Sheppard

unread,

Oct 29, 2013, 8:00:11 AM10/29/13

to pystat...@googlegroups.com

I can see the logic of this method, and it avoids shoving a bunch of args with questionable relevance into fit(). OTOH, it requires 2 lines or chained commands to get robust results, which many might prefer.

I suppose get_robustcov_results should pass any kwargs into the Covariance estimator - I could imagine a few scenarios for this

* Bandwidth parameters for long-run variance estimators (Bartlett/NW, DenHaan-Levin, Andrews PW HAC), which may take different configurations.

* Variables to use when computing clustered standard errors

I suppose changing MLE to QMLE standard errors in ML models could be similarly implemented.

Skipper Seabold

unread,

Oct 29, 2013, 8:49:59 AM10/29/13

to pystat...@googlegroups.com

Yes, this was a recent use case I had / will have with some upcoming PRs, and I thought the solutions should be roughly the same.

Skipper

josef...@gmail.com

unread,

Oct 29, 2013, 9:27:33 AM10/29/13

to pystatsmodels

On Tue, Oct 29, 2013 at 8:49 AM, Skipper Seabold <jsse...@gmail.com> wrote:

On Tue, Oct 29, 2013 at 12:00 PM, Kevin Sheppard <kevin.k....@gmail.com> wrote:

I can see the logic of this method, and it avoids shoving a bunch of args with questionable relevance into fit(). OTOH, it requires 2 lines or chained commands to get robust results, which many might prefer.

That's still open for change. At least, it will be more logical to set the default covariance type in the __init__ of the results.

I suppose get_robustcov_results should pass any kwargs into the Covariance estimator - I could imagine a few scenarios for this

* Bandwidth parameters for long-run variance estimators (Bartlett/NW, DenHaan-Levin, Andrews PW HAC), which may take different configurations.
* Variables to use when computing clustered standard errors

Yes, that's the idea. These are the sandwiches that are currently available after OLS but not connected.

http://statsmodels.sourceforge.net/devel/stats.html#sandwich-robust-covariances

HAC has 3 arguments

http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.sandwich_covariance.cov_hac.html

I also have wild bootstrap after HC0 or HC1 in planning, which seems to be currently the best small sample sandwich for inference. (I know how to do it, but I'm still vague on how to integrate it.)

I suppose changing MLE to QMLE standard errors in ML models could be similarly implemented.

Yes, this was a recent use case I had / will have with some upcoming PRs, and I thought the solutions should be roughly the same.

Once we have the pattern for the robust covariances, we can add it in many places.

One problem with integrating this into existing models that already have many results, is that we should check which additional results are still valid under unspecified heteroscedasticity or correlation.

I got skeptical after seeing that fvalue is wrong with robust covariances, and that reading that after IV2SLS we cannot use some of the diagnostic tests (those need adjustments)

Once we have the cov_type flag, we can adjust to conditional calculations, e.g. for fvalue

https://github.com/josef-pkt/statsmodels/compare/robust_cov#diff-b16dd7e66ce2f7f160d0ef94e2fb40feR1024

However I have no overview over which results are still valid.

(I didn't dig through all of Stata's postestimation results yet.)

Do you have a good reference for this?

For example, I guess `compare_f` doesn't apply (comparing nested linear models by sum of squares).

Are statistics based on loglikelihood, llf, like the likelihood ratio test (*), still valid, especially for QMLE

(*) for example

http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLSResults.compare_lr_test.html

Josef

Skipper

josef...@gmail.com

unread,

Oct 29, 2013, 11:26:40 AM10/29/13

to pystatsmodels

asking Stata, after HC0 with ivreg2, "likely invalid", but we can "force" it

. lrtest A B, stats

LR test likely invalid for models with robust vce

r(498);

. lrtest A B, stats force

Likelihood-ratio test LR chi2(1) = 4.67

(Assumption: B nested in A) Prob > chi2 = 0.0307

-----------------------------------------------------------------------------

Model | Obs ll(null) ll(model) df AIC BIC

-------------+---------------------------------------------------------------

B | 202 . -766.3092 2 1536.618 1543.235

A | 202 . -763.9752 3 1533.95 1543.875

-----------------------------------------------------------------------------

Note: N=Obs used in calculating BIC; see [R] BIC note

Josef

Josef

Skipper

Kevin Sheppard

unread,

Oct 30, 2013, 6:37:27 AM10/30/13

to pystat...@googlegroups.com

Likelihood ratio tests are generally only chi2(m) when the Information Matrix equality holds, and in particular when A^-1*B = I. This always occurs in MLE estimates and pretty much never occurs in QMLE estimates, when a notatble exception of LS estimators when errors are homoskedastic. LR tests have a mixture of chi2 distribution where the weights are the first m eigenvalues of an appropriately rotated version of A^-1*B - again, when the model is MLE, these eigenvalues are all 1, and so the mixture of m chi2's is just a chi2(m).

As I mentioned in the other post, providing alternative frameworks which are valid under either MLE or QMLE, including LM tests, Wald test (using appropriate covariance) or the large-sample LR tests (which are first cousins to LM tests) is probably better.

Thomas Powers

unread,

Feb 12, 2014, 9:09:46 PM2/12/14

to pystat...@googlegroups.com

Hello folks,

I was wondering whether you eventually decided how to implement selection of the cov_type in linear regression models. It seems like one can use the get_robustcov_result() function to switch from one cov_type to another once the result object has been created, but it would be a lot simpler if one could simply set the cov_type before constructing the result object (as mentioned above).

Although there are good reasons for having the choice of cov_type be part of the fit() function outlined above, it seems like it might work well as an argument to the OLS() class init(), because the correct VCE to use completely depends on the parametric model for the error term.

Thanks!

josef...@gmail.com

unread,

Feb 12, 2014, 10:31:14 PM2/12/14

to pystatsmodels

On Wed, Feb 12, 2014 at 9:09 PM, Thomas Powers <tpow...@gmail.com> wrote:
> Hello folks,
>
> I was wondering whether you eventually decided how to implement selection of
> the cov_type in linear regression models. It seems like one can use the
> get_robustcov_result() function to switch from one cov_type to another once
> the result object has been created, but it would be a lot simpler if one
> could simply set the cov_type before constructing the result object (as
> mentioned above).
>
> Although there are good reasons for having the choice of cov_type be part of
> the fit() function outlined above, it seems like it might work well as an
> argument to the OLS() class init(), because the correct VCE to use
> completely depends on the parametric model for the error term.

I convinced myself that it should go into the fit method.
That's the logical choice where the Results instance is created. And
some current Models like RLM, QuantReg and GEE have it already in the
fit method.

In terms of computation, the results method like
get_robustcov_result() would be best, because it requires minimal
recalculation, no new fit to change the cov_type. But, it's
inconvenient when we only want one cov_type.
(Creating an new Results instance in get_robustcov_result() also
requires that we have all the information to create a new results
instance that replicates the current one except for cov_type, which we
are currently not set up for in some models.)

I haven't thought much about putting it into the Model.__init__
because it seemed too early to me. If we want to compare the standard
errors for non-robust and different sandwich covariances, then there
is no real reason to get a new model and fit it again. Also several
results instances can be consistent with the same underlying model,
without having to adjust cov_type specific attributes of the model.

However, one advantage of the Model.__init__ would be the automatic
data handling of extra arrays that are needed for some robust sandwich
covariances, especially panel and cluster robust standard errors.
(Pointed out to me when Kerby moved some things into GEE.__init__ for
this reason.)
The main missing piece right now is how to do missing value handling,
if an additional data array is given later than the Model.__init__. On
the other hand, we also need to make adjustments to the data handling
for example in OLS that doesn't take any additional data arrays.
Currently, get_robustcov_result() assumes that we have the additional
arrays in the correct shape if there were rows with missing values
removed.

Another option that depends on how other models like Panel data are
implemented, is to use the original data information from the
Model.__init__, even if cov_type and it's parameters are specified in
the fit method.
This is currently the case in GEE and will be in Panel, where time and
group/cluster/panel index is specified already in the Model.__init__
but there is still a choice of cov_type in the fit method.
But plain OLS won't require any indices.

So for now I think that the fit method is the best location, both from
the logical structure and in terms of how easy it is to implement.

>... because the correct VCE to use

> completely depends on the parametric model for the error term.

My main approach is pretty pragmatic. We once had the discussion
whether the Model definition include y/endog or not and whether it
should be in Model.__init__ or in Model.fit.
The way I see it right now is that if you have a parametric form for
the error term, iid, autoregressive or heteroscedasticity, then we
need the appropriate model, OLS, WLS, GLSAR, GLSHet, or
statsmodels.tsa.
If we want to be robust to misspecified covariance structure of the
error term, then this is part of inference, which is in the Results,
not in the Model that estimates the main parameters.

As aside: One reason I haven't pushed already for a more convenient
support of sandwiches for the linear models is that we want to have
the same structure across all models.

Josef

Thomas Powers

unread,

Feb 18, 2014, 12:18:36 PM2/18/14

to pystat...@googlegroups.com

Thanks Josef, sounds good!

Best,

Tom Powers

Skipper Seabold

unread,

Feb 18, 2014, 12:42:33 PM2/18/14

to pystat...@googlegroups.com

On Wed, Feb 12, 2014 at 10:31 PM, <josef...@gmail.com> wrote:

On Wed, Feb 12, 2014 at 9:09 PM, Thomas Powers <tpow...@gmail.com> wrote:

<snip>

However, one advantage of the Model.__init__ would be the automatic

data handling of extra arrays that are needed for some robust sandwich
covariances, especially panel and cluster robust standard errors.
(Pointed out to me when Kerby moved some things into GEE.__init__ for
this reason.)

Can you explain a bit more? What extra arrays might be needed? Some weights? The clustering variable?

Right now we only handle missing data value during model instantiation, and I think this is the right way to do it. If there's missing values later, we raise. See below.

The main missing piece right now is how to do missing value handling,
if an additional data array is given later than the Model.__init__. On
the other hand, we also need to make adjustments to the data handling
for example in OLS that doesn't take any additional data arrays.
Currently, get_robustcov_result() assumes that we have the additional
arrays in the correct shape if there were rows with missing values
removed.

If it's really conceptually cleaner to pass a data-conformable array to results or fit, we could have a missing-value handler as a transform that's attached to model.data. Any new arrays that come in later could then go (for a results class)

new_array = self.model.data.missing_transform(new_array)

If there's any missing data here outside of what was originally missing, then it raises. Following our reasoning for moving order to the model instantiation in ARMA, we moved order to __init__ because it can change the endog/exog dimensions. Here, we wouldn't want to allow extra arrays to mess with endog/exog (in fit) or require re-estimation (in results).

This blurs the line a bit between data and models (we'll need to carry an index or use pandas even more centrally).

josef...@gmail.com

unread,

Feb 18, 2014, 1:22:27 PM2/18/14

to pystatsmodels

On Tue, Feb 18, 2014 at 12:42 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Wed, Feb 12, 2014 at 10:31 PM, <josef...@gmail.com> wrote:
>>
>> On Wed, Feb 12, 2014 at 9:09 PM, Thomas Powers <tpow...@gmail.com> wrote:
>
> <snip>
>>
>> However, one advantage of the Model.__init__ would be the automatic
>> data handling of extra arrays that are needed for some robust sandwich
>> covariances, especially panel and cluster robust standard errors.
>> (Pointed out to me when Kerby moved some things into GEE.__init__ for
>> this reason.)
>
>
> Can you explain a bit more? What extra arrays might be needed? Some weights?
> The clustering variable?

The clustering and the time variables.
For example after pooled ols, we can cluster by individuals (within
correlation), by time (contemporaneous correlation) or both.
panel HAC assumes stacked timeseries for each individual, and we need
the "group" index
just HAC assumes sorted by time (and equal spaced time interval) and
doesn't use any array information.

HAC (kernel based) on unequal spaced periods (spatial distance) is
currently not available.

>
> Right now we only handle missing data value during model instantiation, and
> I think this is the right way to do it. If there's missing values later, we
> raise. See below.
>
>
>>
>> The main missing piece right now is how to do missing value handling,
>> if an additional data array is given later than the Model.__init__. On
>> the other hand, we also need to make adjustments to the data handling
>> for example in OLS that doesn't take any additional data arrays.
>> Currently, get_robustcov_result() assumes that we have the additional
>> arrays in the correct shape if there were rows with missing values
>> removed.
>
>
>
> If it's really conceptually cleaner to pass a data-conformable array to
> results or fit, we could have a missing-value handler as a transform that's
> attached to model.data. Any new arrays that come in later could then go (for
> a results class)
>
> new_array = self.model.data.missing_transform(new_array)
>
> If there's any missing data here outside of what was originally missing,
> then it raises. Following our reasoning for moving order to the model
> instantiation in ARMA, we moved order to __init__ because it can change the
> endog/exog dimensions. Here, we wouldn't want to allow extra arrays to mess
> with endog/exog (in fit) or require re-estimation (in results).

Yes, that's about what I was thinking. We need to store the valid
mask, use it as index for any new arrays.
If the reduced arrays don't have the right shape, we raise an error.

>
> This blurs the line a bit between data and models (we'll need to carry an
> index or use pandas even more centrally).

For dropping/selecting rows all we need is the boolean mask and not a
pandas index.

The conflict between statsmodels's and patsy's missing value handling
is mostly a separate issue.

Essentially we split the model for inference from the model for
estimation. The only difference is that the former allows for
unspecified correlation and heteroscedasticity, or for
misspecification compared to the nonrobust/classical/naive inference.

Using bootstrap for inference would be similar to sandwiches as part
of the Results, which kind of bootstrap, pairs, wild, cluster, block,
would also depend which misspecification we want to robustify against
in our inference. (cluster bootstrap needs the `group` array)

Josef

josef...@gmail.com

unread,

Feb 20, 2014, 11:05:56 PM2/20/14

to pystatsmodels

a first version:

two keyword arguments in fit method

cov_type type as string
cov_kwds a dictionary of required or optional keywords used in the
calculation of the sandwiches

examples:

mod_olsg = OLS(g_inv, exogg)
res_hac4b = mod_olsg.fit(cov_type='HAC',
cov_kwds=dict(maxlags=4, use_correction=True))
print res_hac4b.summary()

res_hc1b = mod_olsg.fit(cov_type='HC1')
print res_hc1b.summary()

# force t-distribution
res_hc1c = mod_olsg.fit(cov_type='HC1', cov_kwds={'use_t':True})
print res_hc1c.summary()

another possibility is to separate use_t from the cov_kwds, like

res_hc1c = mod_olsg.fit(cov_type='HC1', use_t=True)

then fit would use three new keywords instead of two.
In most of the other code, I keep `use_t` separate from the cov_type,
which I like better.

2-group-cluster (I didn't try this yet)
res_c = mod_olsg.fit(cov_type='cluster',
cov_kwds=dict(groups=(groups, time),
use_correction=True)
use_t=True)

Josef

Reply all

Reply to author

Forward