"Exposure" with Tweedie

Peter Quackenbush

unread,

Jan 27, 2017, 11:20:36 PM1/27/17

to pystatsmodels

When fitting a GLM with Tweedie, I've found that np.sum(endog) != np.sum(fittedmu). I'm fairly confident is has something to do with the way that "exposure" is handled in Tweedie. As a refresher, endog is loss, which is the amount of money lost due to a claim. Exposure is the length of time (in years) a policy is in force.

Usually, its been off only by 5-10% so I've ignored it (shame on me).

I am now playing with a segment of the business where the distortion is pretty extreme. I can't figure out exactly what's happening.

Here's a gist of the behavior:

https://gist.github.com/thequackdaddy/e28ce2908257185489fd062c12ec5bf9

You can see in the data that mu is 1.43 times the endog. Additionally, I don't think the Hessian is positive-definate, which is also really bad. (See cell 9)

As a "quick" fix, I realized that I can just make np.log(Exposure) part of exog. Is there anything inherently wrong with this?

If you care to see the data, I've saved it here:

https://drive.google.com/drive/folders/0BxH1KD8Ki8CtTUxRUUhLR1E3d1k

I appreciate any input/thoughts on this. I'll try to get this run in R... I have a feeling that this behavior will be duplicated there.

Thanks.

josef...@gmail.com

unread,

Jan 27, 2017, 11:42:26 PM1/27/17

to pystatsmodels

On Fri, Jan 27, 2017 at 11:20 PM, Peter Quackenbush <pqu...@gmail.com> wrote:

When fitting a GLM with Tweedie, I've found that np.sum(endog) != np.sum(fittedmu). I'm fairly confident is has something to do with the way that "exposure" is handled in Tweedie. As a refresher, endog is loss, which is the amount of money lost due to a claim. Exposure is the length of time (in years) a policy is in force.

Usually, its been off only by 5-10% so I've ignored it (shame on me).

I am now playing with a segment of the business where the distortion is pretty extreme. I can't figure out exactly what's happening.

Here's a gist of the behavior:

https://gist.github.com/thequackdaddy/e28ce2908257185489fd062c12ec5bf9

You can see in the data that mu is 1.43 times the endog. Additionally, I don't think the Hessian is positive-definate, which is also really bad. (See cell 9)

As a "quick" fix, I realized that I can just make np.log(Exposure) part of exog. Is there anything inherently wrong with this?

One usual (R's) way is to use offset = log(exposure), which fixes the coefficient to 1. If we include it among exog, then the model is a bit more general and allows and estimates a coefficent that is not fixed at 1.

That can also be used to check if log exposure really enters in the standard linear way.

Using offset might not help for debugging this or working around this, because offset and log exposure are internally combined and treated in the same way.

If you care to see the data, I've saved it here:

https://drive.google.com/drive/folders/0BxH1KD8Ki8CtTUxRUUhLR1E3d1k

I appreciate any input/thoughts on this. I'll try to get this run in R... I have a feeling that this behavior will be duplicated there.

Thanks.

Just a quick answer: You can try to check with predict which lets you explicitly specify offset and exposure.

Some defaults might be to use exposure = 1 to get unit prediction instead of for the exposure used in the data.

Josef

Peter Quackenbush

unread,

Jan 28, 2017, 12:09:55 AM1/28/17

to pystatsmodels

As I think about it, the assumption that a policy that is twice is long will have twice as many claims is problematic. Practically, when a claim occurs its likely that a change to a policy occurs shortly thereafter. For example, if a vehicle is totaled in the first week of the policy, the policyholder is likely to swap out vehicles (which actually means a new record will be created). Alternately, a claim might occur and the claim leads us to new information about the policy which causes underwriting to cancel it.

Maybe using log(exposure) in exog is the way to go... As you can see, the estimated for log(exposure) is only 0.31 which suggests that claims don't scale with the "exposure" linearly.

Thanks.

josef...@gmail.com

unread,

Jan 28, 2017, 1:33:24 AM1/28/17

to pystatsmodels

On Sat, Jan 28, 2017 at 12:09 AM, Peter Quackenbush <pqu...@gmail.com> wrote:

As I think about it, the assumption that a policy that is twice is long will have twice as many claims is problematic. Practically, when a claim occurs its likely that a change to a policy occurs shortly thereafter. For example, if a vehicle is totaled in the first week of the policy, the policyholder is likely to swap out vehicles (which actually means a new record will be created). Alternately, a claim might occur and the claim leads us to new information about the policy which causes underwriting to cancel it.

But from the density and the tweedie assumption, it looks like there are records, or whatever defines a case for you, that have several events/claims, partial damage, mostly damage to other parties?

Maybe using log(exposure) in exog is the way to go... As you can see, the estimated for log(exposure) is only 0.31 which suggests that claims don't scale with the "exposure" linearly.

If you only regress on constant and log exposure, then log exposure will also catch other effects. For example, if there are experience effects, than longer exposure might also mean safer or more experienced drivers than those who just got a new car and still need to get used to it.

(If exposure stops at a claim, then exposure would also be `endogenous`, but I don't know what would be the effect in this case. The results might/would be partially misleading if small exposure is highly correlated with claims/losses.)

Peter Quackenbush

unread,

Jan 28, 2017, 3:15:53 PM1/28/17

to pystatsmodels

So this dataset is really so-called "collision" coverage on older, cheaper cars--generally under $10k. There is no liability to 3rd parties in the data, only to the insured and his lien-holder (if applicable). Multiple claims on a record are definitely possible. Of the 970k reocrds, 16k have 1 claim, 260 have 2 claims, and 7 have 3 claims. These claims can be anything from a fender bender in a parking lot, to a sideswipe, to a head-on collision at full speed.

The claim can result in a decision that the vehicle is repairable or totaled. Vehicles are declared a total loss (totaled) when the repair cost exceeds the vehicle's value. When a vehicle is totaled, the insurer essentially buys the vehicle from the insured and sells it for salvage value. (The endog is really paid loss + unpaid estimate - salvage - subrogation).

Also if the policy was sold to a 45 year old married female who stated she lived alone and was the only driver, but her undelcared 16 year old son was driving at the time of the accident, underwriting is likely to order the policy be cancelled. By law, we can't use this as an excuse to not pay the claim, but we can refuse to continue coverage.

When a claim occurs, the chances that the policy is cancelled mid-term escalate dramatically. Its not a sure-thing either way... it really depends on the claim's value.

When I add more to the exog (driver age, sex, gender, vehicle age, risk scores/classes) the param for np.log(exposure) hovers around 0.31... far less than the 1 that I would've thought.

josef...@gmail.com

unread,

Jan 28, 2017, 3:46:29 PM1/28/17

to pystatsmodels

On Sat, Jan 28, 2017 at 3:15 PM, Peter Quackenbush <pqu...@gmail.com> wrote:

So this dataset is really so-called "collision" coverage on older, cheaper cars--generally under $10k. There is no liability to 3rd parties in the data, only to the insured and his lien-holder (if applicable). Multiple claims on a record are definitely possible. Of the 970k reocrds, 16k have 1 claim, 260 have 2 claims, and 7 have 3 claims. These claims can be anything from a fender bender in a parking lot, to a sideswipe, to a head-on collision at full speed.

The claim can result in a decision that the vehicle is repairable or totaled. Vehicles are declared a total loss (totaled) when the repair cost exceeds the vehicle's value. When a vehicle is totaled, the insurer essentially buys the vehicle from the insured and sells it for salvage value. (The endog is really paid loss + unpaid estimate - salvage - subrogation).

Also if the policy was sold to a 45 year old married female who stated she lived alone and was the only driver, but her undelcared 16 year old son was driving at the time of the accident, underwriting is likely to order the policy be cancelled. By law, we can't use this as an excuse to not pay the claim, but we can refuse to continue coverage.

When a claim occurs, the chances that the policy is cancelled mid-term escalate dramatically. Its not a sure-thing either way... it really depends on the claim's value.

IMO, In this cases you are NOT allowed to include exposure among the regressors. Exposure for contracts/policies that start during the year and have reduced exposure would be ok, if experience effect is taken out or doesn't exist.
But reduced exposure because of termination of the policy would be endogenous, i.e. correlated but not causal, and you cannot use it for prediction. If you knew that a policy would terminate early, the conditional probability of having a claim would go up a lot, but you wouldn't know before a claim or termination occurs.

Also, early termination would make it more likely that the damage/claim was large, i.e. larger than the mean without conditioning on early termination.

When I add more to the exog (driver age, sex, gender, vehicle age, risk scores/classes) the param for np.log(exposure) hovers around 0.31... far less than the 1 that I would've thought.

For the "experience" effect, I would include how many years the policy was already in effect, to check whether there is increased risk in the early life of a policy.

Related: If observations are policies, can you make the link to the previous history of the driver, e.g. if the driver changed the policy because (s)he replaced the car with a new one, or are you not allowed to use that information.

Another check: If you have the information, whether early termination during the year occured, then you could run the regression for the "policy in effect at end of year" subsample and see how much the coefficient change compared to the full sample. (There is still a selection effect because some risky drivers will have dropped out of that sample.)

Josef

Peter Quackenbush

unread,

Jan 28, 2017, 5:01:42 PM1/28/17

to pystatsmodels

As always, Josef, you are right. Incorporating np.log(exposure) into exog changes some parameters in ways I'm not comfortable with

Thanks for your suggestions. I can definitely try out most of them. I thought about building models only on renewal business as that is viewed as a safer risk--people who switch out insurers frequently are very active claim-filers. "Linking" new records to old might be a little too tall of an order just because I don't think the keys are there. However, I could calculate exposure as (Scheduled Expiration Date - Effective Date).

The real bothersome part to me is that simply fitting a log-Tweedie with 1.55 var-power model on this dataset--even if the exog is only the intercept--results in sum(fittedmu) 1.43 times larger than sum(endog) and a non positive-definate Hessian.

Thanks.

josef...@gmail.com

unread,

Jan 28, 2017, 5:21:27 PM1/28/17

to pystatsmodels

On Sat, Jan 28, 2017 at 5:01 PM, Peter Quackenbush <pqu...@gmail.com> wrote:

As always, Josef, you are right. Incorporating np.log(exposure) into exog changes some parameters in ways I'm not comfortable with

Thanks for your suggestions. I can definitely try out most of them. I thought about building models only on renewal business as that is viewed as a safer risk--people who switch out insurers frequently are very active claim-filers. "Linking" new records to old might be a little too tall of an order just because I don't think the keys are there. However, I could calculate exposure as (Scheduled Expiration Date - Effective Date).

The real bothersome part to me is that simply fitting a log-Tweedie with 1.55 var-power model on this dataset--even if the exog is only the intercept--results in sum(fittedmu) 1.43 times larger than sum(endog) and a non positive-definate Hessian.

It's still possible that there is a bug in the tweedie exposure handling. It's likely that exposure isn't included somewhere. Is `mu` supposed to be with or without exposure? (I will still need to go back to the code to remember the details.)

Did you try using `predict(exog, exposure=exposure) directly?

Also check how Poisson and Gamma with log-link are doing on this dataset, to see if there is anything tweedie specific for the non-positive definite hessian.

Josef

josef...@gmail.com

unread,

Jan 28, 2017, 5:28:40 PM1/28/17

to pystatsmodels

On Sat, Jan 28, 2017 at 5:21 PM, <josef...@gmail.com> wrote:

On Sat, Jan 28, 2017 at 5:01 PM, Peter Quackenbush <pqu...@gmail.com> wrote:
As always, Josef, you are right. Incorporating np.log(exposure) into exog changes some parameters in ways I'm not comfortable with

Thanks for your suggestions. I can definitely try out most of them. I thought about building models only on renewal business as that is viewed as a safer risk--people who switch out insurers frequently are very active claim-filers. "Linking" new records to old might be a little too tall of an order just because I don't think the keys are there. However, I could calculate exposure as (Scheduled Expiration Date - Effective Date).

The real bothersome part to me is that simply fitting a log-Tweedie with 1.55 var-power model on this dataset--even if the exog is only the intercept--results in sum(fittedmu) 1.43 times larger than sum(endog) and a non positive-definate Hessian.

It's still possible that there is a bug in the tweedie exposure handling. It's likely that exposure isn't included somewhere. Is `mu` supposed to be with or without exposure? (I will still need to go back to the code to remember the details.)

Did you try using `predict(exog, exposure=exposure) directly?

Also check how Poisson and Gamma with log-link are doing on this dataset, to see if there is anything tweedie specific for the non-positive definite hessian.

And, does tweedie work as expected with this dataset if you leave out exposure completely, i.e. pretend everyone has the same exposure? Is the data, are observations per year or over the lifetime of a policy? If the second than exposures might differ too much to ignore.

Josef

Peter Quackenbush

unread,

Jan 28, 2017, 5:48:46 PM1/28/17

to pystatsmodels

It's still possible that there is a bug in the tweedie exposure handling. It's likely that exposure isn't included somewhere. Is `mu` supposed to be with or without exposure? (I will still need to go back to the code to remember the details.)

I checked that I get the same behavior in R insofar as sum(fittedmu) is 1.43 bigger than sum(endog). I haven't checked the Hessian piece (never did that in R... not sure how... documentation is a little confusing and R's packages can be quite fragmented)

Did you try using `predict(exog, exposure=exposure) directly?

Yes. That works as expected. I also tried offset=np.log(expoure)

Also check how Poisson and Gamma with log-link are doing on this dataset, to see if there is anything tweedie specific for the non-positive definite hessian.

I haven't tried yet on Gamma, but on Poisson, sum(fittedmu) == sum(endog) and the Hessian was positive definate

And, does tweedie work as expected with this dataset if you leave out exposure completely, i.e. pretend everyone has the same exposure? Is the data, are observations per year or over the lifetime of a policy? If the second than exposures might differ too much to ignore.

I *can* do that and the results are sensical, but the model misses some things. Specifically, the policies we sell can be purchased for 1 month, 2 months, 3 months, or 6 months (depending on what the agent/insured wants). We obviously expect that a record for a 6 month policy will have more claims than a 1 month policy.

josef...@gmail.com

unread,

Jan 28, 2017, 6:30:36 PM1/28/17

to pystatsmodels

On Sat, Jan 28, 2017 at 5:48 PM, Peter Quackenbush <pqu...@gmail.com> wrote:

It's still possible that there is a bug in the tweedie exposure handling. It's likely that exposure isn't included somewhere. Is `mu` supposed to be with or without exposure? (I will still need to go back to the code to remember the details.)

I checked that I get the same behavior in R insofar as sum(fittedmu) is 1.43 bigger than sum(endog). I haven't checked the Hessian piece (never did that in R... not sure how... documentation is a little confusing and R's packages can be quite fragmented)

to clarify: did you get the same discrepancy in R?

Are you using statsmodels master or one of your enhanced branches? or it's the same in both?

I don't see any tweedie specific code related to predict and exposure/offset handling. So, I don't see a way that this would result in different behaviors across families.

Josef

Peter Quackenbush

unread,

Jan 28, 2017, 7:19:29 PM1/28/17

to pystatsmodels

This is with master...

Yes, R gives the exact same prediction as statsmodels and shows sum(fittedmu) / sum(endog) == 1.43

I haven't checked that R is calculating the Hessian the same way (not sure why it wouldn't)...

josef...@gmail.com

unread,

Jan 28, 2017, 7:38:10 PM1/28/17

to pystatsmodels

On Sat, Jan 28, 2017 at 7:19 PM, Peter Quackenbush <pqu...@gmail.com> wrote:

This is with master...

Yes, R gives the exact same prediction as statsmodels and shows sum(fittedmu) / sum(endog) == 1.43

I was playing for a while with it, but don't find any explanation.

I think bugs are ruled out in exposure handling, especially if R has the same results. predict and everything looks fine. The main reason is that the parameter estimate, I only looked at the simple case with constant only, doesn't satisfy the aggregation, zero mean residual property.

My next guess was on numerical problems with underflow or something because the exposures are so small. But nothing looks bad enough that this would be a likely source.

I haven't checked that R is calculating the Hessian the same way (not sure why it wouldn't)...

I'm not sure what's going on with the hessian eigenvalues. The reported bse are proper numbers, negative eigenvalues usually lead to nans. Almost all of the hessian computation is generic and not family specific, it's either link of variance function specific. So I'm not sure there is a problem there.

Josef

josef...@gmail.com

unread,

Jan 28, 2017, 8:54:05 PM1/28/17

to pystatsmodels

On Sat, Jan 28, 2017 at 7:38 PM, <josef...@gmail.com> wrote:

On Sat, Jan 28, 2017 at 7:19 PM, Peter Quackenbush <pqu...@gmail.com> wrote:
This is with master...

Yes, R gives the exact same prediction as statsmodels and shows sum(fittedmu) / sum(endog) == 1.43

I was playing for a while with it, but don't find any explanation.
I think bugs are ruled out in exposure handling, especially if R has the same results. predict and everything looks fine. The main reason is that the parameter estimate, I only looked at the simple case with constant only, doesn't satisfy the aggregation, zero mean residual property.

My next guess was on numerical problems with underflow or something because the exposures are so small. But nothing looks bad enough that this would be a likely source.

The mean of the score_factor is small -4.49e-17. The estimating equations for the constant are a weighted average of the response residuals. Now, I start to doubt whether the response residuals need to have mean zero if the link is not canonical, I don't see anymore why the weighting should drop out. Maybe mean(endog) == mean(fitted) depends on the canonical link. (I don't remember the relevant theory.)

This should be then independent of whether the variation in mu comes from offset/exposure or other variables.

If we only regress on a constant, then the weights in the score_factor don't vary and the unweighted response residuals have to be zero.

>>> model1.score(res1.params)
array([ -4.36675140e-11])

Josef

Peter Quackenbush

unread,

Jan 28, 2017, 11:09:44 PM1/28/17

to pystatsmodels

So if I understand this, currently in generalized_linear_model, score_factor and hessian_factor use only params to get the mu. Are you saying we should probably add offset and exposure to that calculation? Seems like an easy(-ish) PR.

Do you think that would fix the non-positive-definiteness that I'm seeing?

Thanks for your help again.

Peter

josef...@gmail.com

unread,

Jan 28, 2017, 11:34:53 PM1/28/17

to pystatsmodels

On Sat, Jan 28, 2017 at 11:09 PM, Peter Quackenbush <pqu...@gmail.com> wrote:

So if I understand this, currently in generalized_linear_model, score_factor and hessian_factor use only params to get the mu. Are you saying we should probably add offset and exposure to that calculation? Seems like an easy(-ish) PR.

As far as I could see, `mu` correctly includes offset_exposure through the `predict` calculation. So, I didn't see anything that needs changing.

The main issue is whether we need to fix our expectation or if there is something to fix in the model. I'm currently tending towards the former.

Do you think that would fix the non-positive-definiteness that I'm seeing?

I didn't manage to work my way through that part (before my patience for tonight ran out). The final results look fine, but I always get confused about sign of hessian and normalized cov_params.

Why is a single constant hessian positve but a 2-d hessian is negative definite? And why does the former still end up with a positive bse and cov_params?

I have no idea at the moment.

Josef

josef...@gmail.com

unread,

Jan 29, 2017, 12:44:21 AM1/29/17

to pystatsmodels

On Sat, Jan 28, 2017 at 11:34 PM, <josef...@gmail.com> wrote:

On Sat, Jan 28, 2017 at 11:09 PM, Peter Quackenbush <pqu...@gmail.com> wrote:
So if I understand this, currently in generalized_linear_model, score_factor and hessian_factor use only params to get the mu. Are you saying we should probably add offset and exposure to that calculation? Seems like an easy(-ish) PR.

As far as I could see, `mu` correctly includes offset_exposure through the `predict` calculation. So, I didn't see anything that needs changing.

The main issue is whether we need to fix our expectation or if there is something to fix in the model. I'm currently tending towards the former.

If I interpret the SAS manual correctly which only has canoncial parameter, but doesn't explicitly say what the canonical link for tweedie is.

canonical link is power with link_power = 1 - var_power

Now, verified with SPSS manual

then resid_response.mean() is zero, while it is not zero for log-link

>>> mod3 = smf.glm("loss ~ 1 + np.log(data['exposure'])", data=data, family=sm.families.Tweedie(var_power=1.55, link_power=-0.55))
>>> res3 = mod3.fit()
>>> res3.resid_response.mean()
-1.9984608423258687e-12

>>> res3.mu.mean()
49.315994688906272
>>> res3.fittedvalues.mean()
49.31599468881241

Josef

josef...@gmail.com

unread,

Jan 29, 2017, 1:40:37 PM1/29/17

to pystatsmodels

On Sun, Jan 29, 2017 at 12:44 AM, <josef...@gmail.com> wrote:

On Sat, Jan 28, 2017 at 11:34 PM, <josef...@gmail.com> wrote:

On Sat, Jan 28, 2017 at 11:09 PM, Peter Quackenbush <pqu...@gmail.com> wrote:
So if I understand this, currently in generalized_linear_model, score_factor and hessian_factor use only params to get the mu. Are you saying we should probably add offset and exposure to that calculation? Seems like an easy(-ish) PR.

As far as I could see, `mu` correctly includes offset_exposure through the `predict` calculation. So, I didn't see anything that needs changing.

The main issue is whether we need to fix our expectation or if there is something to fix in the model. I'm currently tending towards the former.

If I interpret the SAS manual correctly which only has canoncial parameter, but doesn't explicitly say what the canonical link for tweedie is.

canonical link is power with link_power = 1 - var_power
Now, verified with SPSS manual

then resid_response.mean() is zero, while it is not zero for log-link

>>> mod3 = smf.glm("loss ~ 1 + np.log(data['exposure'])", data=data, family=sm.families.Tweedie(var_power=1.55, link_power=-0.55))
>>> res3 = mod3.fit()
>>> res3.resid_response.mean()
-1.9984608423258687e-12

>>> res3.mu.mean()
49.315994688906272
>>> res3.fittedvalues.mean()
49.31599468881241

I wanted to add this instead of fittedvalues last night
>>> mod3.endog.mean()
49.315994688903885

If the numbers are not wrong, we have to come up with a story why they are correct after all. I didn't find any specific references because this might be "obvious" or assumed away. I guess Wooldridge big textbook would say something but it's too verbose to find anything quickly.

My guess is now that the large discrepancy between endog.mean() and fitted.mean() with log-link is an indication of a misspecified model, i.e. E(y | x) is not correctly specified.

GLM with a canonical link forces the constant so endog.mean() = fitted.mean(), but that doesn't mean the model is correctly specified, it just uses a specific variance, link combination that forces that the constant corresponds to an unweighted average. In contrast, non-canonical links use a (variance) weighted average and a nonlinear mean function.

(I'm still not sure about the math, and I never tried to figure out GLM theory with canonical links and "natural" parameters because it always sounded back-to-front to me. Estimating equations are more "obvious".)

Josef

josef...@gmail.com

unread,

Jan 29, 2017, 2:54:19 PM1/29/17

to pystatsmodels

The above sounds plausible when mu fluctuates because of offset or other explanatory variables. However, I think there is something wrong with treating exposure like an offset = log(exposure).

Specifically, I think the variance function should be different whether mu changes because of an explanatory variable and changes because of exposure.

However, this would mean exposure is also treated incorrectly in NegativeBinomial, or at least there might be two different interpretation of exposure.

Both NegativeBinomial and Tweedie are mixture distributions with Poisson and the mean, variance and the parameters depend on both distributions. If exposure is for the Poisson part, then we might have to separate so the effect of exposure doesn't get mixed up with the part from the other distribution. In Tweedie, increasing exposure increases/scales proportionally the distribution of claims, but doesn't affect the loss distribution per claim.

Based on a quick back-of-the-envelop calculation, the exposure should then drop out from the weighting in the estimating equation and resid_response should be zero in the regression on a constant only, i.e. exposure weighted average in the model is the same as exposure weighted average in the sample. In contrast to canonical links, it will not hold with fluctuating regressors, but at least we wouldn't get extra "bias" from fluctuation in exposure.

(I'm still a bit puzzled because I thought we have exposure unit tests for NegativeBinomial against Stata.)

Analogy: In economics it is sometimes important to distinguish between extensive and intensive margin, the former is a change in the number of individuals e.g. who switches to buying, and the latter is the amount that changes for a given individual, e.g. how much do they buy given that they are buying something. I think exposure should only affect the extensive margin and not a mixture of both margins.

Josef

Thinkos are never ruled out.

josef...@gmail.com

unread,

Jan 29, 2017, 3:19:04 PM1/29/17

to pystatsmodels

Related: one older issue when I was a bit suspicious about interpretation of exposure https://github.com/statsmodels/statsmodels/issues/2175

Over the last year I started to think that this usecase can also be handled through var_weights, which would most likely create another series of question like "How can I use var_weights to include exposure?".

Josef

Peter Quackenbush

unread,

Jan 29, 2017, 6:16:46 PM1/29/17

to pystatsmodels

Related: one older issue when I was a bit suspicious about interpretation of exposure https://github.com/statsmodels/statsmodels/issues/2175
Over the last year I started to think that this usecase can also be handled through var_weights, which would most likely create another series of question like "How can I use var_weights to include exposure?".

Josef

Thanks for the discussion, Josef. Very helpful.

Intuitively, it makes sense that the policy length should scale the poisson part of the distribution, but not the Gamma, and handling that through var_weights might be the only option.

I did run across this post on a web forum for how to define tweedie:

http://www.actuarialoutpost.com/actuarial_discussion_forum/archive/index.php/t-288599.html

> Glm <- glm(Y ~ X1 + X2 + X3 +X4,

> family=tweedie(var.power=tweedie.p, link.power=0),

> data=Collision, offset=log(Exposure),

> weights=Exposure^(tweedie.p - 1))

I haven't done the math (and would likely mess it up in any event) but in R when I use the same weights on my data, I get a sum(endog) == sum(mu)

model <- glm(loss ~ 1,

family = tweedie(var.power=1.55, link.power = 0),

offset = log(exposure),

data = data,

weights = exposure ^ 0.55)

sum(model$fitted.values) / sum(data$loss)

I can't find a reasoning for the "exposure ^(p - 1)" thing, but its close enough I don't think there's a coincidence. When I tried weights = exposure ^ 0.5 or 0.6, it no longer matched.

Maybe exposure ^ p - 1 is backing out the non-Poisson piece from the variance weights?

josef...@gmail.com

unread,

Jan 29, 2017, 6:37:43 PM1/29/17

to pystatsmodels

On Sun, Jan 29, 2017 at 6:16 PM, Peter Quackenbush <pqu...@gmail.com> wrote:

Related: one older issue when I was a bit suspicious about interpretation of exposure https://github.com/statsmodels/statsmodels/issues/2175
Over the last year I started to think that this usecase can also be handled through var_weights, which would most likely create another series of question like "How can I use var_weights to include exposure?".

Josef

Thanks for the discussion, Josef. Very helpful.

Intuitively, it makes sense that the policy length should scale the poisson part of the distribution, but not the Gamma, and handling that through var_weights might be the only option.

I did run across this post on a web forum for how to define tweedie:

http://www.actuarialoutpost.com/actuarial_discussion_forum/archive/index.php/t-288599.html

> Glm <- glm(Y ~ X1 + X2 + X3 +X4,
> family=tweedie(var.power=tweedie.p, link.power=0),
> data=Collision, offset=log(Exposure),
> weights=Exposure^(tweedie.p - 1))

I haven't done the math (and would likely mess it up in any event) but in R when I use the same weights on my data, I get a sum(endog) == sum(mu)

model <- glm(loss ~ 1,
family = tweedie(var.power=1.55, link.power = 0),
offset = log(exposure),
data = data,
weights = exposure ^ 0.55)

sum(model$fitted.values) / sum(data$loss)

That's good, that would confirm my argument.

I think in R you can also drop the offset and use loss / exposure as endog and weights=exposure (or maybe weights=exposure^2)

AFAIU, that should produce the same parameter estimates and exposure standardized residuals as what you have here.

I can't find a reasoning for the "exposure ^(p - 1)" thing, but its close enough I don't think there's a coincidence. When I tried weights = exposure ^ 0.5 or 0.6, it no longer matched.

Maybe exposure ^ p - 1 is backing out the non-Poisson piece from the variance weights?

Yes, according to my calculations, the variance should be linear in exposure, but as part of mu it is raised to the var_power. I think weights are a inverse variance factor, so it would be exposure**(1-p) in the variance to back out the exposure/mu effect.

I was only looking at estimating equations, what I haven't tried to figure out yet is what this means for the tweedie distribution itself. I'm starting to think that we would need a poisson-exposure parameterized distribution similar to what Binomial is doing with Bernoulli (i.e. something like exposure=number of trials).

Josef

Peter Quackenbush

unread,

Jan 29, 2017, 6:54:32 PM1/29/17

to pystatsmodels

Thanks!

Yes, indeed R with endog = loss / exposure and weight = exposure you get the same params and cov_params as endog = loss, offset = log(exposure) and weight = exposure ** (p - 1)

Would you open to PR for https://github.com/statsmodels/statsmodels/issues/3371? Maybe then I could finally be happy...

josef...@gmail.com

unread,

Jan 29, 2017, 8:48:57 PM1/29/17

to pystatsmodels

If you want to work on it, then a PR would be very welcome and I would appreciate it. (I don't seem to be able to find the time for more than very short coding projects right now.)
I found the SPSS algorithm manual, GENLIN in IBM_SPSS_Statistics_Algorithms.pdf, the most explicit for this. It has a large formula collection and keeps frequency weights and var_weights separate in the formulas. That makes it clearer where they affect the results in the same way and where they need to differ. (I'm reading an older version from which I printed some sections years ago.)

There is one PR for the families that should be merged before making related changes to avoid merge conflicts.

Josef

josef...@gmail.com

unread,

Jan 29, 2017, 10:05:58 PM1/29/17

to pystatsmodels

one reference
Heller, Gillian, Mikis Stasinopoulos, and Bob Rigby. "The zero-adjusted inverse Gaussian distribution as a model for insurance claims." In Proceedings of the 21th International Workshop on Statistical Modelling, vol. 226233. 2006.

the R GAMLSS guys

They derive zero-adjusted inverse Gaussian from Poisson/Logit - InvGaus mixture and explicitly handle the poisson exposure. Also, the inverse gaussian is additive in the number of claims.

extra: AFAIR from the last time we `tweedie`ed, all GLM distributions are additive (independent of canonical link or not), so it should be possible to do exactly the same as Binomial/Bernoulli and add exposure as the distribution of yhe sum of independent and identically GLM distributed random variables. That sounds like a clean solution to #2175

It should be equivalent to R's version with rescaling endog and using var_weights. Exposure models the sum and rescaling with var_weights models the average.