On Sat, Nov 5, 2016 at 1:17 PM, Peter Quackenbush <pqu...@gmail.com> wrote:Hello,I'm trying to get my head around some issues I've run across with cov_type = 'HC0'.I'm fitting a GLM with a Tweedie distribution. The dataset is insurance data so the response reflects incurred losses from a set of auto insurance policies. The params from the distribution are reasonable... things I've seen before and don't bother me. When I use the default cov_type ('nonrobust'), the z-scores appear to be reasonable as well.I tried fitting my models with the cov_type='HC0' and most of the models show slightly different z-scores, but its not really concerning.However, one particular model didn't work out this way. The params are unaffected, which I think is the expected behavior. The z-scores are wildly different (much lower, resulting in much higher P>|z| values.This model contains a much more volatile line of busines--"bodily injury" payments. These are medical payments, hospital treatments, rehab treatment, payments for pain and suffering, lawyer bills, the whole variety of things people win in auto accident lawsuits and settlements.Bodily injury payments are generally much higher--and have much greater variance than the other models which generally deal with damage to property, other automobiles, and stolen/vandalized vehicles.Off the top of it, I don't see much in my dataset that is obviously something which should cause lots of heteroskedasticity. Honestly though, I've never had a software tool that had a 'HC0' covariance-type available.So I'm asking a vague question here... but how would you approach this problem? Or is it a problem?There could be three or four reasons in general in GLM that HC0 differs a lot from "nonrobust" cov_params, that I can think of:1) a bug: not ruled out given that the combination cov_type, weights and Tweedie is relatively recent code. Given that your other cases look good, it's not very likely.2) dispersion: In models with fixed scale, HC0 captures over or under dispersion. This can have a huge effect for example in Poisson, but shouldn't have an effect on Tweedie if the scale is correctly estimated4) I think minor issue: default IRLS uses expected information matrix, HC0 also uses the Hessian. There can be numerical differences even if the model is correctly specified.
C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\base\model.py:473: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available 'available', HessianInversionWarning)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-37-70c747b28c1c> in <module>() 77 del bx 78 ---> 79 res = model.fit(rtol=1e-2, tol_criterion='params', maxiter=20, method='lbfgs') # HC0 causes really high P>|z| C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\genmod\generalized_linear_model.py in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, full_output, disp, max_start_irls, **kwargs) 1000 cov_kwds=cov_kwds, use_t=use_t, 1001 max_start_irls=max_start_irls, -> 1002 **kwargs) 1003 1004 def _fit_gradient(self, start_params=None, method="newton", C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\genmod\generalized_linear_model.py in _fit_gradient(self, start_params, method, maxiter, tol, full_output, disp, scale, cov_type, cov_kwds, use_t, max_start_irls, **kwargs) 1026 1027 glm_results = GLMResults(self, rslt.params, -> 1028 rslt.normalized_cov_params / scale, 1029 scale, 1030 cov_type=cov_type, cov_kwds=cov_kwds, TypeError: unsupported operand type(s) for /: 'NoneType' and 'float'
'newton' didn't error out, but the cov_params were completely non-sensical. There were lots of NaNs and extremely large/small values.
I've never been able to get the scipy optimizers to work on my datasets. Perhaps that's a bug?
So running sm.OLS(res.resid_params, model.exog).fit() doesn't really return anything *too* weird.
There's a few factors with high params (the biggest was a little over 1). These factors tend to be very rare features. For example, a driver with a probationary license (called an SR-22) has a value around 1.2. When I removed that factor, the next highest was 0.3, but the z-scores were still poor.
I tried again to fit the model using method='bfgs', method='lbfgs', and method='newton'. Both 'bfgs' and 'lbfgs' failed with these error messages:C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\base\model.py:473: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available 'available', HessianInversionWarning)--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-37-70c747b28c1c> in <module>() 77 del bx 78 ---> 79 res = model.fit(rtol=1e-2, tol_criterion='params', maxiter=20, method='lbfgs') # HC0 causes really high P>|z| C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\genmod\generalized_linear_model.py in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, full_output, disp, max_start_irls, **kwargs) 1000 cov_kwds=cov_kwds, use_t=use_t, 1001 max_start_irls=max_start_irls, -> 1002 **kwargs) 1003 1004 def _fit_gradient(self, start_params=None, method="newton", C:\Anaconda3\lib\site-packages\statsmodels-0.8.0-py3.5-win-amd64.egg\statsmodels\genmod\generalized_linear_model.py in _fit_gradient(self, start_params, method, maxiter, tol, full_output, disp, scale, cov_type, cov_kwds, use_t, max_start_irls, **kwargs) 1026 1027 glm_results = GLMResults(self, rslt.params, -> 1028 rslt.normalized_cov_params / scale, 1029 scale, 1030 cov_type=cov_type, cov_kwds=cov_kwds, TypeError: unsupported operand type(s) for /: 'NoneType' and 'float'
'newton' didn't error out, but the cov_params were completely non-sensical. There were lots of NaNs and extremely large/small values.
I've never been able to get the scipy optimizers to work on my datasets. Perhaps that's a bug?