cov_type = 'HC0' increases z-scores

40 views
Skip to first unread message

Peter Quackenbush

unread,
Nov 5, 2016, 1:33:59 PM11/5/16
to pystatsmodels
Hello,

I'm trying to get my head around some issues I've run across with cov_type = 'HC0'. 

I'm fitting a GLM with a Tweedie distribution. The dataset is insurance data so the response reflects incurred losses from a set of auto insurance policies. The params from the distribution are reasonable... things I've seen before and don't bother me. When I use the default cov_type ('nonrobust'), the z-scores appear to be reasonable as well. 

I tried fitting my models with the cov_type='HC0' and most of the models show slightly different z-scores, but its not really concerning.

However, one particular model didn't work out this way. The params are unaffected, which I think is the expected behavior. The z-scores are wildly different (much lower, resulting in much higher P>|z| values.This model contains a much more volatile line of busines--"bodily injury" payments. These are medical payments, hospital treatments, rehab treatment, payments for pain and suffering, lawyer bills, the whole variety of things people win in auto accident lawsuits and settlements. 

Bodily injury payments are generally much higher--and have much greater variance than the other models which generally deal with damage to property, other automobiles, and stolen/vandalized vehicles. 

Off the top of it, I don't see much in my dataset that is obviously something which should cause lots of heteroskedasticity. Honestly though, I've never had a software tool that had a 'HC0' covariance-type available. 

So I'm asking a vague question here... but how would you approach this problem? Or is it a problem? 

josef...@gmail.com

unread,
Nov 5, 2016, 2:11:18 PM11/5/16
to pystatsmodels
There could be three or four reasons in general in GLM that HC0 differs a lot from "nonrobust" cov_params, that I can think of:

1) a bug: not ruled out given that the combination cov_type, weights and Tweedie is relatively recent code. Given that your other cases look good, it's not very likely.

2) dispersion: In models with fixed scale, HC0 captures over or under dispersion. This can have a huge effect for example in Poisson, but shouldn't have an effect on Tweedie if the scale is correctly estimated

4) I think minor issue: default IRLS uses expected information matrix, HC0 also uses the Hessian. There can be numerical differences even if the model is correctly specified.

6) small sample problems in HC calculation: in some cases the small sample behavior of sandwiches is not very good.
   That should not be a problem with a well posed (not near singular) model and data with large samples.

5) the main point of HC is heteroscedasticity, i.e. scale/dispersion fluctuates systematically with the explanatory variables. This could be the case if some of your subgroups have much larger dispersion than others.
A quick simple check:
run OLS of squared pearson residuals on the exog of the model and check size and significance of the parameter estimates. (similarly we could use other residuals)

I would check 5) first because it's simple and a likely candidate.

Josef

josef...@gmail.com

unread,
Nov 5, 2016, 2:17:50 PM11/5/16
to pystatsmodels
On Sat, Nov 5, 2016 at 2:11 PM, <josef...@gmail.com> wrote:


On Sat, Nov 5, 2016 at 1:17 PM, Peter Quackenbush <pqu...@gmail.com> wrote:
Hello,

I'm trying to get my head around some issues I've run across with cov_type = 'HC0'. 

I'm fitting a GLM with a Tweedie distribution. The dataset is insurance data so the response reflects incurred losses from a set of auto insurance policies. The params from the distribution are reasonable... things I've seen before and don't bother me. When I use the default cov_type ('nonrobust'), the z-scores appear to be reasonable as well. 

I tried fitting my models with the cov_type='HC0' and most of the models show slightly different z-scores, but its not really concerning.

However, one particular model didn't work out this way. The params are unaffected, which I think is the expected behavior. The z-scores are wildly different (much lower, resulting in much higher P>|z| values.This model contains a much more volatile line of busines--"bodily injury" payments. These are medical payments, hospital treatments, rehab treatment, payments for pain and suffering, lawyer bills, the whole variety of things people win in auto accident lawsuits and settlements. 

Bodily injury payments are generally much higher--and have much greater variance than the other models which generally deal with damage to property, other automobiles, and stolen/vandalized vehicles. 

Off the top of it, I don't see much in my dataset that is obviously something which should cause lots of heteroskedasticity. Honestly though, I've never had a software tool that had a 'HC0' covariance-type available. 

So I'm asking a vague question here... but how would you approach this problem? Or is it a problem? 

There could be three or four reasons in general in GLM that HC0 differs a lot from "nonrobust" cov_params, that I can think of:

1) a bug: not ruled out given that the combination cov_type, weights and Tweedie is relatively recent code. Given that your other cases look good, it's not very likely.

2) dispersion: In models with fixed scale, HC0 captures over or under dispersion. This can have a huge effect for example in Poisson, but shouldn't have an effect on Tweedie if the scale is correctly estimated

4) I think minor issue: default IRLS uses expected information matrix, HC0 also uses the Hessian. There can be numerical differences even if the model is correctly specified.

to check Hessian versus expected information matrix:
AFAIR, they are the defaults for the two estimation methods, IRLS nonrobust is expected information matrix, scipy optimizer nonrobust cov_params is by default inverse Hessian.
So, comparing the results of the two optimization methods should produce the same params but slightly different cov_params if a non-canonical link is used. 

Peter Quackenbush

unread,
Nov 5, 2016, 8:45:41 PM11/5/16
to pystatsmodels
So running sm.OLS(res.resid_params, model.exog).fit() doesn't really return anything *too* weird. 

There's a few factors with high params (the biggest was a little over 1). These factors tend to be very rare features. For example, a driver with a probationary license (called an SR-22) has a value around 1.2. When I removed that factor, the next highest was 0.3, but the z-scores were still poor. 

I tried again to fit the model using method='bfgs', method='lbfgs', and method='newton'. Both 'bfgs' and 'lbfgs' failed with these error messages:

C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\base\model.py:473: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
  'available', HessianInversionWarning)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-70c747b28c1c> in <module>()
     77 del bx
     78 
---> 79 res = model.fit(rtol=1e-2, tol_criterion='params', maxiter=20, method='lbfgs') # HC0 causes really high P>|z|

C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\genmod\generalized_linear_model.py in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, full_output, disp, max_start_irls, **kwargs)
   1000                                       cov_kwds=cov_kwds, use_t=use_t,
   1001                                       max_start_irls=max_start_irls,
-> 1002                                       **kwargs)
   1003 
   1004     def _fit_gradient(self, start_params=None, method="newton",

C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\genmod\generalized_linear_model.py in _fit_gradient(self, start_params, method, maxiter, tol, full_output, disp, scale, cov_type, cov_kwds, use_t, max_start_irls, **kwargs)
   1026 
   1027         glm_results = GLMResults(self, rslt.params,
-> 1028                                  rslt.normalized_cov_params / scale,
   1029                                  scale,
   1030                                  cov_type=cov_type, cov_kwds=cov_kwds,

TypeError: unsupported operand type(s) for /: 'NoneType' and 'float'

'newton' didn't error out, but the cov_params were completely non-sensical. There were lots of NaNs and extremely large/small values. 

I've never been able to get the scipy optimizers to work on my datasets. Perhaps that's a bug? 

josef...@gmail.com

unread,
Nov 5, 2016, 9:43:19 PM11/5/16
to pystatsmodels
On Sat, Nov 5, 2016 at 8:45 PM, Peter Quackenbush <pqu...@gmail.com> wrote:
So running sm.OLS(res.resid_params, model.exog).fit() doesn't really return anything *too* weird. 

resid_pearson**2  as endog ?
 

There's a few factors with high params (the biggest was a little over 1). These factors tend to be very rare features. For example, a driver with a probationary license (called an SR-22) has a value around 1.2. When I removed that factor, the next highest was 0.3, but the z-scores were still poor. 

How large is the r_square of the OLS regression? Are some of the coefficients highly significant?

Somebody, I don't remember who, wrote something like he doesn't like sandwich robust covariance to cover up misspecification, but it's a useful tool to check whether there are problems.
I think using robust covariances is appropriate, but it's better if there is a known reason about what might cause the problems and the misspecification.

Given that I don't have access to the proprietary data, a few more things to try you

To see whether it's specific to tweedie: Compare with GLM Poisson and Gamma both with log link.
BTW: what's the tweedie power coefficient in this case?
Poisson should have a ratio of HC0 standard errors to nonrobust standard errors approximately equal to the (sqrt ? of) pearson chisquare (i.e. over dispersion).
Gamma has dispersion built in and has just a different tweedie power.

scaling: you could try to scale down the endog if the payments are very large, e.g. so endog is roughly in the 1 to 100 range, e.g. make endog units in terms of $100,000.

problems with the hessian: check the eigenvalues of the hessian (or inverse hessian). There could still be numerical problems with categorical explanatory variables that only have a small sample that makes hessian inversion inaccurate.

too much heterogeneity in the sample:
drop observations that might be dominated by some characteristics that only affects a small subsample like the SR-22
drop 10% of the most expensive cases (largest endog) to see whether there are outlier effects.
try a subset that looks pretty homogenous, e.g. cases where only one person has been injured.


in the style of sensitivity analysis:
The effective parameters (effect in dollars) should be relatively stable across models if the exponential mean function is correct.
The dispersion/standard errors could fluctuate a lot if there is a large amount of heteroscedasticiy, which would be captured by HC0.

Josef








 

I tried again to fit the model using method='bfgs', method='lbfgs', and method='newton'. Both 'bfgs' and 'lbfgs' failed with these error messages:

C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\base\model.py:473: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
  'available', HessianInversionWarning)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-70c747b28c1c> in <module>()
     77 del bx
     78 
---> 79 res = model.fit(rtol=1e-2, tol_criterion='params', maxiter=20, method='lbfgs') # HC0 causes really high P>|z|

C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\genmod\generalized_linear_model.py in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, full_output, disp, max_start_irls, **kwargs)
   1000                                       cov_kwds=cov_kwds, use_t=use_t,
   1001                                       max_start_irls=max_start_irls,
-> 1002                                       **kwargs)
   1003 
   1004     def _fit_gradient(self, start_params=None, method="newton",

C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\genmod\generalized_linear_model.py in _fit_gradient(self, start_params, method, maxiter, tol, full_output, disp, scale, cov_type, cov_kwds, use_t, max_start_irls, **kwargs)
   1026 
   1027         glm_results = GLMResults(self, rslt.params,
-> 1028                                  rslt.normalized_cov_params / scale,
   1029                                  scale,
   1030                                  cov_type=cov_type, cov_kwds=cov_kwds,

TypeError: unsupported operand type(s) for /: 'NoneType' and 'float'

That looks like a bug, if `normalized_cov_params` is not available. Maybe normalized_cov_params is None when the hessian cannot be calculated or inverted. That shouln't prevent the creation of the GLMResults instance even though some results won't be available.


'newton' didn't error out, but the cov_params were completely non-sensical. There were lots of NaNs and extremely large/small values. 

I've never been able to get the scipy optimizers to work on my datasets. Perhaps that's a bug?

Sorry, I forgot that this is Tweedie. AFAIR we haven't set up an objective function for the optimizers, did we?
Reply all
Reply to author
Forward
0 new messages