Why am I getting nan on the first guess of the deviance function?

2,394 views
Skip to first unread message

Jordan Howell

unread,
Aug 22, 2019, 9:43:52 AM8/22/19
to pystatsmodels
Hello,

I'm trying to run a GLM with a tweedie family, and keep getting the following error.  I'm not seeing what exactly to do in order to clear this up.

Warning and error below:

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\families\family.py:1427: RuntimeWarning: invalid value encountered in sqrt
endog * mu ** (1-p) / (1 - p) + mu ** (2 - p) / (2 - p))

 ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-282-bb94ac0700a8> in <module>
----> 1 pricing_model_comp_1_results = pricing_model_comp_1.fit()

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\generalized_linear_model.py in fit(self, start_params, maxiter, method, tol, scale, cov_type, cov_kwds, use_t, full_output, disp, max_start_irls, **kwargs)
   1010             return self._fit_irls(start_params=start_params, maxiter=maxiter,
   1011                                   tol=tol, scale=scale, cov_type=cov_type,
-> 1012                                   cov_kwds=cov_kwds, use_t=use_t, **kwargs)
   1013         else:
   1014             self._optim_hessian = kwargs.get('optim_hessian')

C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\genmod\generalized_linear_model.py in _fit_irls(self, start_params, maxiter, tol, scale, cov_type, cov_kwds, use_t, **kwargs)
   1107                                    self.freq_weights, self.scale)
   1108         if np.isnan(dev):
-> 1109             raise ValueError("The first guess on the deviance function "
   1110                              "returned a nan.  This could be a boundary "
   1111                              " problem and should be reported.")

ValueError: The first guess on the deviance function returned a nan.  This could be a boundary  problem and should be reported.


Code below:

#modeling
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy



formula
= 'target~ eff_year + C(STATE) + rba_model + driver_age_model + marital_status_model_S + \
 marital_status_model_not_available + vehicle_age_model + length_ft_model + yrs_owned_model + \
 cm_ded_model + majorvio + minorvio + atfault + DTMND_VEH_TYPE_CD_AH + DTMND_VEH_TYPE_CD_AN + \
 DTMND_VEH_TYPE_CD_AU + DTMND_VEH_TYPE_CD_FW + DTMND_VEH_TYPE_CD_PC + DTMND_VEH_TYPE_CD_ST + \
 DTMND_VEH_TYPE_CD_SU + DTMND_VEH_TYPE_CD_TC + DTMND_VEH_TYPE_CD_TH + DTMND_VEH_TYPE_CD_UT'




#turn formula into a matrix of data for the model
y
, x = patsy.dmatrices(formula, data, return_type = 'dataframe')



weight
= data['cmeu']



model_1
= sm.GLM(y,x, family = sm.families.Tweedie(link = sm.families.links.log, var_power = 1.5)
 
, weights = weight)

model_1_results
= model_1.fit()





Peter Quackenbush

unread,
Aug 22, 2019, 11:01:41 AM8/22/19
to pystat...@googlegroups.com
Instead of weights=weight use var_weight=weight

Can you verify that...

All values of weight are strictly > 0.

All values of y are >=0
--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/7e7c2603-389b-4b4b-a455-3f7700c5f840%40googlegroups.com.

Peter Quackenbush

unread,
Aug 22, 2019, 11:12:23 AM8/22/19
to pystat...@googlegroups.com
Another couple checks... 

np.isnan(model_1.exog).sum() should be 0
np.isnan(model_1.endog).sum() should be 0 
(model_1.endog < 0).sum() should be 0 
np.isnan(model_1.endog).sum() should be 0 
(model_1.var_weights <= 0).sum() should be 0 

Peter Quackenbush

unread,
Aug 22, 2019, 11:36:10 AM8/22/19
to pystat...@googlegroups.com
(I meant var_weights=weight) with an “s” 

Note GLM allows either var_weights are freq_weights. Smarter people can explain difference better than I can, but var_weights are analogous to weights in WLS. If you believe that variance scales linearly with exposure (time) in a Tweedie distribution, that’s the way to go.

If your weights represent represent repeated observations, then you want freq_weights. 




On Aug 22, 2019, at 10:00 AM, Peter Quackenbush <pqu...@gmail.com> wrote:

Jordan Howell

unread,
Aug 22, 2019, 12:04:23 PM8/22/19
to pystat...@googlegroups.com
Yep. I had eight rows where y < 0. It runs now! Thank you.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/EB6587D5-A6A5-42FA-BC6F-F691F1191132%40gmail.com.



--
Respectfully,

Jordan Howell
Principal Data Scientist
Candid Truth DSC
www.candidtruthdsc.com
253-266-8088
Reply all
Reply to author
Forward
0 new messages