On Wed, Feb 12, 2014 at 9:09 PM, Thomas Powers <
tpow...@gmail.com> wrote:
> Hello folks,
>
> I was wondering whether you eventually decided how to implement selection of
> the cov_type in linear regression models. It seems like one can use the
> get_robustcov_result() function to switch from one cov_type to another once
> the result object has been created, but it would be a lot simpler if one
> could simply set the cov_type before constructing the result object (as
> mentioned above).
>
> Although there are good reasons for having the choice of cov_type be part of
> the fit() function outlined above, it seems like it might work well as an
> argument to the OLS() class init(), because the correct VCE to use
> completely depends on the parametric model for the error term.
I convinced myself that it should go into the fit method.
That's the logical choice where the Results instance is created. And
some current Models like RLM, QuantReg and GEE have it already in the
fit method.
In terms of computation, the results method like
get_robustcov_result() would be best, because it requires minimal
recalculation, no new fit to change the cov_type. But, it's
inconvenient when we only want one cov_type.
(Creating an new Results instance in get_robustcov_result() also
requires that we have all the information to create a new results
instance that replicates the current one except for cov_type, which we
are currently not set up for in some models.)
I haven't thought much about putting it into the Model.__init__
because it seemed too early to me. If we want to compare the standard
errors for non-robust and different sandwich covariances, then there
is no real reason to get a new model and fit it again. Also several
results instances can be consistent with the same underlying model,
without having to adjust cov_type specific attributes of the model.
However, one advantage of the Model.__init__ would be the automatic
data handling of extra arrays that are needed for some robust sandwich
covariances, especially panel and cluster robust standard errors.
(Pointed out to me when Kerby moved some things into GEE.__init__ for
this reason.)
The main missing piece right now is how to do missing value handling,
if an additional data array is given later than the Model.__init__. On
the other hand, we also need to make adjustments to the data handling
for example in OLS that doesn't take any additional data arrays.
Currently, get_robustcov_result() assumes that we have the additional
arrays in the correct shape if there were rows with missing values
removed.
Another option that depends on how other models like Panel data are
implemented, is to use the original data information from the
Model.__init__, even if cov_type and it's parameters are specified in
the fit method.
This is currently the case in GEE and will be in Panel, where time and
group/cluster/panel index is specified already in the Model.__init__
but there is still a choice of cov_type in the fit method.
But plain OLS won't require any indices.
So for now I think that the fit method is the best location, both from
the logical structure and in terms of how easy it is to implement.
>... because the correct VCE to use
> completely depends on the parametric model for the error term.
My main approach is pretty pragmatic. We once had the discussion
whether the Model definition include y/endog or not and whether it
should be in Model.__init__ or in Model.fit.
The way I see it right now is that if you have a parametric form for
the error term, iid, autoregressive or heteroscedasticity, then we
need the appropriate model, OLS, WLS, GLSAR, GLSHet, or
statsmodels.tsa.
If we want to be robust to misspecified covariance structure of the
error term, then this is part of inference, which is in the Results,
not in the Model that estimates the main parameters.
As aside: One reason I haven't pushed already for a more convenient
support of sandwiches for the linear models is that we want to have
the same structure across all models.
Josef