When fitting a GLM with Tweedie, I've found that np.sum(endog) != np.sum(fittedmu). I'm fairly confident is has something to do with the way that "exposure" is handled in Tweedie. As a refresher, endog is loss, which is the amount of money lost due to a claim. Exposure is the length of time (in years) a policy is in force.Usually, its been off only by 5-10% so I've ignored it (shame on me).I am now playing with a segment of the business where the distortion is pretty extreme. I can't figure out exactly what's happening.Here's a gist of the behavior:You can see in the data that mu is 1.43 times the endog. Additionally, I don't think the Hessian is positive-definate, which is also really bad. (See cell 9)As a "quick" fix, I realized that I can just make np.log(Exposure) part of exog. Is there anything inherently wrong with this?
If you care to see the data, I've saved it here:I appreciate any input/thoughts on this. I'll try to get this run in R... I have a feeling that this behavior will be duplicated there.Thanks.
As I think about it, the assumption that a policy that is twice is long will have twice as many claims is problematic. Practically, when a claim occurs its likely that a change to a policy occurs shortly thereafter. For example, if a vehicle is totaled in the first week of the policy, the policyholder is likely to swap out vehicles (which actually means a new record will be created). Alternately, a claim might occur and the claim leads us to new information about the policy which causes underwriting to cancel it.
Maybe using log(exposure) in exog is the way to go... As you can see, the estimated for log(exposure) is only 0.31 which suggests that claims don't scale with the "exposure" linearly.
So this dataset is really so-called "collision" coverage on older, cheaper cars--generally under $10k. There is no liability to 3rd parties in the data, only to the insured and his lien-holder (if applicable). Multiple claims on a record are definitely possible. Of the 970k reocrds, 16k have 1 claim, 260 have 2 claims, and 7 have 3 claims. These claims can be anything from a fender bender in a parking lot, to a sideswipe, to a head-on collision at full speed.The claim can result in a decision that the vehicle is repairable or totaled. Vehicles are declared a total loss (totaled) when the repair cost exceeds the vehicle's value. When a vehicle is totaled, the insurer essentially buys the vehicle from the insured and sells it for salvage value. (The endog is really paid loss + unpaid estimate - salvage - subrogation).Also if the policy was sold to a 45 year old married female who stated she lived alone and was the only driver, but her undelcared 16 year old son was driving at the time of the accident, underwriting is likely to order the policy be cancelled. By law, we can't use this as an excuse to not pay the claim, but we can refuse to continue coverage.When a claim occurs, the chances that the policy is cancelled mid-term escalate dramatically. Its not a sure-thing either way... it really depends on the claim's value.
When I add more to the exog (driver age, sex, gender, vehicle age, risk scores/classes) the param for np.log(exposure) hovers around 0.31... far less than the 1 that I would've thought.
As always, Josef, you are right. Incorporating np.log(exposure) into exog changes some parameters in ways I'm not comfortable withThanks for your suggestions. I can definitely try out most of them. I thought about building models only on renewal business as that is viewed as a safer risk--people who switch out insurers frequently are very active claim-filers. "Linking" new records to old might be a little too tall of an order just because I don't think the keys are there. However, I could calculate exposure as (Scheduled Expiration Date - Effective Date).The real bothersome part to me is that simply fitting a log-Tweedie with 1.55 var-power model on this dataset--even if the exog is only the intercept--results in sum(fittedmu) 1.43 times larger than sum(endog) and a non positive-definate Hessian.
On Sat, Jan 28, 2017 at 5:01 PM, Peter Quackenbush <pqu...@gmail.com> wrote:As always, Josef, you are right. Incorporating np.log(exposure) into exog changes some parameters in ways I'm not comfortable withThanks for your suggestions. I can definitely try out most of them. I thought about building models only on renewal business as that is viewed as a safer risk--people who switch out insurers frequently are very active claim-filers. "Linking" new records to old might be a little too tall of an order just because I don't think the keys are there. However, I could calculate exposure as (Scheduled Expiration Date - Effective Date).The real bothersome part to me is that simply fitting a log-Tweedie with 1.55 var-power model on this dataset--even if the exog is only the intercept--results in sum(fittedmu) 1.43 times larger than sum(endog) and a non positive-definate Hessian.It's still possible that there is a bug in the tweedie exposure handling. It's likely that exposure isn't included somewhere. Is `mu` supposed to be with or without exposure? (I will still need to go back to the code to remember the details.)Did you try using `predict(exog, exposure=exposure) directly?Also check how Poisson and Gamma with log-link are doing on this dataset, to see if there is anything tweedie specific for the non-positive definite hessian.
It's still possible that there is a bug in the tweedie exposure handling. It's likely that exposure isn't included somewhere. Is `mu` supposed to be with or without exposure? (I will still need to go back to the code to remember the details.)
Did you try using `predict(exog, exposure=exposure) directly?
Also check how Poisson and Gamma with log-link are doing on this dataset, to see if there is anything tweedie specific for the non-positive definite hessian.
And, does tweedie work as expected with this dataset if you leave out exposure completely, i.e. pretend everyone has the same exposure? Is the data, are observations per year or over the lifetime of a policy? If the second than exposures might differ too much to ignore.
It's still possible that there is a bug in the tweedie exposure handling. It's likely that exposure isn't included somewhere. Is `mu` supposed to be with or without exposure? (I will still need to go back to the code to remember the details.)I checked that I get the same behavior in R insofar as sum(fittedmu) is 1.43 bigger than sum(endog). I haven't checked the Hessian piece (never did that in R... not sure how... documentation is a little confusing and R's packages can be quite fragmented)
This is with master...Yes, R gives the exact same prediction as statsmodels and shows sum(fittedmu) / sum(endog) == 1.43
I haven't checked that R is calculating the Hessian the same way (not sure why it wouldn't)...
On Sat, Jan 28, 2017 at 7:19 PM, Peter Quackenbush <pqu...@gmail.com> wrote:This is with master...Yes, R gives the exact same prediction as statsmodels and shows sum(fittedmu) / sum(endog) == 1.43I was playing for a while with it, but don't find any explanation.I think bugs are ruled out in exposure handling, especially if R has the same results. predict and everything looks fine. The main reason is that the parameter estimate, I only looked at the simple case with constant only, doesn't satisfy the aggregation, zero mean residual property.My next guess was on numerical problems with underflow or something because the exposures are so small. But nothing looks bad enough that this would be a likely source.
So if I understand this, currently in generalized_linear_model, score_factor and hessian_factor use only params to get the mu. Are you saying we should probably add offset and exposure to that calculation? Seems like an easy(-ish) PR.
Do you think that would fix the non-positive-definiteness that I'm seeing?
On Sat, Jan 28, 2017 at 11:09 PM, Peter Quackenbush <pqu...@gmail.com> wrote:So if I understand this, currently in generalized_linear_model, score_factor and hessian_factor use only params to get the mu. Are you saying we should probably add offset and exposure to that calculation? Seems like an easy(-ish) PR.As far as I could see, `mu` correctly includes offset_exposure through the `predict` calculation. So, I didn't see anything that needs changing.The main issue is whether we need to fix our expectation or if there is something to fix in the model. I'm currently tending towards the former.
On Sat, Jan 28, 2017 at 11:34 PM, <josef...@gmail.com> wrote:On Sat, Jan 28, 2017 at 11:09 PM, Peter Quackenbush <pqu...@gmail.com> wrote:So if I understand this, currently in generalized_linear_model, score_factor and hessian_factor use only params to get the mu. Are you saying we should probably add offset and exposure to that calculation? Seems like an easy(-ish) PR.As far as I could see, `mu` correctly includes offset_exposure through the `predict` calculation. So, I didn't see anything that needs changing.The main issue is whether we need to fix our expectation or if there is something to fix in the model. I'm currently tending towards the former.If I interpret the SAS manual correctly which only has canoncial parameter, but doesn't explicitly say what the canonical link for tweedie is.canonical link is power with link_power = 1 - var_powerNow, verified with SPSS manualthen resid_response.mean() is zero, while it is not zero for log-link
>>> mod3 = smf.glm("loss ~ 1 + np.log(data['exposure'])", data=data, family=sm.families.Tweedie(var_power=1.55, link_power=-0.55))
>>> res3 = mod3.fit()
>>> res3.resid_response.mean()
-1.9984608423258687e-12
>>> res3.mu.mean()
49.315994688906272
>>> res3.fittedvalues.mean()
49.31599468881241
Related: one older issue when I was a bit suspicious about interpretation of exposure https://github.com/statsmodels/statsmodels/issues/2175Over the last year I started to think that this usecase can also be handled through var_weights, which would most likely create another series of question like "How can I use var_weights to include exposure?".Josef
Related: one older issue when I was a bit suspicious about interpretation of exposure https://github.com/statsmodels/statsmodels/issues/2175Over the last year I started to think that this usecase can also be handled through var_weights, which would most likely create another series of question like "How can I use var_weights to include exposure?".JosefThanks for the discussion, Josef. Very helpful.Intuitively, it makes sense that the policy length should scale the poisson part of the distribution, but not the Gamma, and handling that through var_weights might be the only option.I did run across this post on a web forum for how to define tweedie:> Glm <- glm(Y ~ X1 + X2 + X3 +X4,> family=tweedie(var.power=tweedie.p, link.power=0),> data=Collision, offset=log(Exposure),> weights=Exposure^(tweedie.p - 1))I haven't done the math (and would likely mess it up in any event) but in R when I use the same weights on my data, I get a sum(endog) == sum(mu)model <- glm(loss ~ 1,family = tweedie(var.power=1.55, link.power = 0),offset = log(exposure),data = data,weights = exposure ^ 0.55)sum(model$fitted.values) / sum(data$loss)
I can't find a reasoning for the "exposure ^(p - 1)" thing, but its close enough I don't think there's a coincidence. When I tried weights = exposure ^ 0.5 or 0.6, it no longer matched.Maybe exposure ^ p - 1 is backing out the non-Poisson piece from the variance weights?