gamma glm with canonical link in stan

433 views
Skip to first unread message

Marco Inacio

unread,
Feb 6, 2014, 10:05:40 PM2/6/14
to stan-...@googlegroups.com
Hi everybody. I'm intending to do a gamma regression using stan and according to what I read, the gamma distribution canonical link is:

$\beta X =
      -1/\mu$  (where $\mu$ is the mean)

And since $\mu =
      \alpha/\beta$  and $\alpha, \beta
      > 0$.

Then $\beta X < 0$, which seems a little problematic condition.

Did I got it right? If so, does Stan work fine with such limitation in the possible values of $\beta$ or do I need to configure it somehow, or do you just recommend me to use another kind of link function like $\beta X
      = log(\mu)$ which forces a positive $\mu$?

(I hope the Google groups accept images in the emails)

Bob Carpenter

unread,
Feb 7, 2014, 10:12:31 AM2/7/14
to stan-...@googlegroups.com
I have no idea how people typically do this. There's no
good way to constrain a vector beta so that for arbitrary
vector x we have x' * beta > 0 (or x' * beta < 0).

You could always exponentiate, but that's a different link
function. That does seem to be what R does judging from the
doc for its glm() function:

# A Gamma example, from McCullagh & Nelder (1989, pp. 300-2)
clotting <- data.frame(
u = c(5,10,15,20,30,40,60,80,100),
lot1 = c(118,58,42,35,27,25,21,19,18),
lot2 = c(69,35,26,21,18,16,13,12,12))
summary(glm(lot1 ~ log(u), data=clotting, family=Gamma))
summary(glm(lot2 ~ log(u), data=clotting, family=Gamma))

- Bob

On 2/7/14, 4:05 AM, Marco Inacio wrote:
> Hi everybody. I'm intending to do a gamma regression using stan and according to what I read, the gamma distribution
> canonical link is:
>
> $\beta X = -1/\mu$ (where $\mu$ is the mean)
>
> And since $\mu = \alpha/\beta$ and $\alpha, \beta > 0$.
>
> Then $\beta X < 0$, which seems a little problematic condition.
>
> Did I got it right? If so, does Stan work fine with such limitation in the possible values of $\beta$ or do I need to
> configure it somehow, or do you just recommend me to use another kind of link function like $\beta X = log(\mu)$ which
> forces a positive $\mu$?
>
> (I hope the Google groups accept images in the emails)
>
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Marco Inacio

unread,
Feb 7, 2014, 11:43:07 AM2/7/14
to stan-...@googlegroups.com


On 14-02-07 01:12 PM, Bob Carpenter wrote:
I have no idea how people typically do this.  There's no
good way to constrain a vector beta so that for arbitrary
vector x we have x' * beta > 0 (or x' * beta < 0).

I'm happy to see I'm not the only one finding this strange.



You could always exponentiate, but that's a different link
function.  That does seem to be what R does judging from the
doc for its glm() function:

# A Gamma example, from McCullagh & Nelder (1989, pp. 300-2)
clotting <- data.frame(
    u = c(5,10,15,20,30,40,60,80,100),
    lot1 = c(118,58,42,35,27,25,21,19,18),
    lot2 = c(69,35,26,21,18,16,13,12,12))
summary(glm(lot1 ~ log(u), data=clotting, family=Gamma))
summary(glm(lot2 ~ log(u), data=clotting, family=Gamma))

- Bob

That's not exactly another link function (which would be glm(lot1 ~ u, data=clotting, family=Gamma(link="log")) ). Doing this, I think we would still have problems for big values of $u$ if $\beta_1 > 0$  and $\beta_0 < 0$ 'cause then $\beta_0 +
      \beta_1*u > 0$ and then $\mu < 0$, but this will probably only happen in places where we are extrapolating $u$ anyway.

I guess what I might need to do is to discover [how to discover] the "correct" distribution to use in my data, but I think this falls outside the scope of the mailing list.

Marcus Brubaker

unread,
Feb 7, 2014, 12:09:40 PM2/7/14
to stan-...@googlegroups.com
On Fri, Feb 7, 2014 at 10:12 AM, Bob Carpenter <ca...@alias-i.com> wrote:
I have no idea how people typically do this.  There's no
good way to constrain a vector beta so that for arbitrary
vector x we have x' * beta > 0 (or x' * beta < 0).

Actually this isn't too hard to do in Stan if one of those vectors is data.  Lets say x is a fixed N dimensional data vector.  Then use something like Gram-Schmidt to find the set of N-1 unit vectors v_i which are orthogonal to x (i.e., such that dot_product(x,v_i) = 0).  Then \beta = \alpha x + \sum_i \gamma_i v_i  where the \gamma_i are unconstrained and \alpha is constrained to be negative.  It's then simple to show that \beta x will always be negative.

Cheers,
Marcus

 

You could always exponentiate, but that's a different link
function.  That does seem to be what R does judging from the
doc for its glm() function:

# A Gamma example, from McCullagh & Nelder (1989, pp. 300-2)
clotting <- data.frame(
    u = c(5,10,15,20,30,40,60,80,100),
    lot1 = c(118,58,42,35,27,25,21,19,18),
    lot2 = c(69,35,26,21,18,16,13,12,12))
summary(glm(lot1 ~ log(u), data=clotting, family=Gamma))
summary(glm(lot2 ~ log(u), data=clotting, family=Gamma))

- Bob


On 2/7/14, 4:05 AM, Marco Inacio wrote:
Hi everybody. I'm intending to do a gamma regression using stan and according to what I read, the gamma distribution
canonical link is:

$\beta X = -1/\mu$  (where $\mu$ is the mean)

And since $\mu = \alpha/\beta$  and $\alpha, \beta > 0$.

Then $\beta X < 0$, which seems a little problematic condition.

Did I got it right? If so, does Stan work fine with such limitation in the possible values of $\beta$ or do I need to
configure it somehow, or do you just recommend me to use another kind of link function like $\beta X = log(\mu)$ which
forces a positive $\mu$?


(I hope the Google groups accept images in the emails)

--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

Marco Inacio

unread,
Feb 8, 2014, 10:52:36 AM2/8/14
to stan-...@googlegroups.com



On 14-02-07 03:09 PM, Marcus Brubaker wrote:
On Fri, Feb 7, 2014 at 10:12 AM, Bob Carpenter <ca...@alias-i.com> wrote:
I have no idea how people typically do this.  There's no
good way to constrain a vector beta so that for arbitrary
vector $x$ we have $x' *
                beta > 0$ (or $x' *
                beta < 0$).

Actually this isn't too hard to do in Stan if one of those vectors is data.  Lets say $x$ is a fixed $N$ dimensional data vector.  Then use something like Gram-Schmidt to find the set of $N-1$ unit vectors $v_i$ which are orthogonal to $x$ (i.e., such that dot_product(x,v_i) = 0).  Then $\beta
                = \alpha x + \sum_i \gamma_i v_i$  where the $\gamma_i$ are unconstrained and $\alpha$ is constrained to be negative.  It's then simple to show that $\beta x$ will always be negative.


Thanks, I wonder if there's a way to extend this to a matrix of data $x$, we could do it for each column of the matrix, but seems a little too strong to force every $\beta * x_i$ ($x_i$ is column $i$ of matrix $x$) to be negative instead of just the sum them. But no need to worry about this now, I'll prefer using log-link anyway.


Could you please help with this other problem? I wish to know if need to add Jacobian here in this case 'cause this model is taking more than 10 hours for just 100 iterations with very low number of n_eff.

If it's not the cause of the problem, then any hints what might help?

parameters
{
  vector [ncol] beta;
  vector [ncol] gamma;
  real sigma;
}
transformed parameters
{
  vector [nZeros] eta0;
  vector [nZeros] zeta0;
  vector [nOnes] eta1;
  vector [nOnes] zeta1;
  eta0 <- bbgg0 * beta;
  zeta0 <- bbgg0 * gamma;
  eta1 <- bbgg1 * beta;
  zeta1 <- bbgg1 * gamma;
}
model {
  vector [nOnes] summands;
  vector [nOnes] alpha;

  0 ~ bernoulli_logit(zeta0);

  1 ~ bernoulli_logit(eta1);
  for(i in 1:nOnes) {
    //kernel of log-gamma distribution in a computational eficient way
    //exp(eta1[i]) is the mean, sigma is the standard deviation
    alpha[i] <- exp(2*(eta1[i] - log(sigma)));
    summands[i] <- alpha[i]*(eta1[i] + log(yy1[i]) - 2*log(sigma) - exp(log(yy1[i]) - eta1[i])) - lgamma(alpha[i]);
  }

  increment_log_prob(sum(summands));

  beta ~ normal(0, 100);
  gamma ~ normal(0, 100);
  increment_log_prob(-log(sigma));
}






Though I don't think it's relevant to explain the model, here it is:

It's the same problem I had one year ago with zero-inflated and hurdle regressions (https://groups.google.com/d/topic/stan-users/l-s_7evFRbE/discussion)

Sorry to come at this again, but almost one year has passed after that post and I still couldn't find a way to get this posterior, not only in Stan, but also using adaptive MCMC-like methods like Hario, DRAM, Vihola and programmed in C++.

I want to model a regression where the response variable has the hurdle distribution:

$\begin{array}{rlll} f(y; \theta, w) & = & (1 - w)
      & \textrm{, if } y = 0 \\  f(y; \theta, w) & = & w *
      g(y; \theta) / g(0; \theta) & \textrm{, if } y \neq 0
      \end{array}$

Basically it models all zeros of the response variable in a Bernoulli with probability $w$ and the non-zeros in a $g(.)$ density function (zero-truncated), where $\theta$ vector of parameters of $g(.)$. In the model above $g(.)$ is a gamma-function.

Bob Carpenter

unread,
Feb 8, 2014, 12:04:12 PM2/8/14
to stan-...@googlegroups.com


On 2/8/14, 4:52 PM, Marco Inacio wrote:
>
>
>
> On 14-02-07 03:09 PM, Marcus Brubaker wrote:
>> On Fri, Feb 7, 2014 at 10:12 AM, Bob Carpenter <ca...@alias-i.com <mailto:ca...@alias-i.com>> wrote:
>>
>> I have no idea how people typically do this. There's no
>> good way to constrain a vector beta so that for arbitrary
>> vector $x$ we have $x' * beta > 0$ (or $x' * beta < 0$).
>>
>>
>> Actually this isn't too hard to do in Stan if one of those vectors is data. Lets say $x$ is a fixed $N$ dimensional
>> data vector. Then use something like Gram-Schmidt to find the set of $N-1$ unit vectors $v_i$ which are orthogonal to
>> $x$ (i.e., such that dot_product(x,v_i) = 0). Then $\beta = \alpha x + \sum_i \gamma_i v_i$ where the $\gamma_i$ are
>> unconstrained and $\alpha$ is constrained to be negative. It's then simple to show that $\beta x$ will always be
>> negative.
>
>
> Thanks, I wonder if there's a way to extend this to a matrix of data $x$, we could do it for each column of the matrix,
> but seems a little too strong to force every $\beta * x_i$ ($x_i$ is column $i$ of matrix $x$) to be negative instead of
> just the sum them. But no need to worry about this now, I'll prefer using log-link anyway.
>
>
> Could you please help with this other problem? I wish to know if need to add Jacobian here in this case 'cause this
> model is taking more than 10 hours for just 100 iterations with very low number of n_eff.

I don't understand the model, but here are some suggestions and
questions.

Given that log(sigma) shows up, you almost certainly want to be
defining sigma with a lower bound,

real<lower=0> sigma;

The constraints declared on variables needs to match their support
in the model.

Add a Jacobian for what? You only need a Jacobian adjustment
for a change of variables.

I take it in increment_log_prob(sum(summands)) that
each summand is supposed to be a gamma distribution?

lgamma() is very expensive to evaluate, so if the loop is large,
this is going to be slow.

You want to define a local variable log_sigma =log(sigma)
rather than continually recomputing it.

Unless you want to inspect all those transformed parameters
in the output, you can just make them local variables in the
model.

exp(log(a) - b) is more efficiently coded as a * exp(-b)
as long as exp(-b) won't overflow or underflow.

exp(2*(eta1[i] - log(sigma)))

= square(exp(eta1[i] - log(sigma)))

= square(-sigma * exp(eta1[i])

= square(-sigma) * square(exp(eta1[i]])

and now you can just save square(-sigma) and reuse it
rather than recomputing it. Same for exp(eta1[i])
which gets reused in the summands.

You define eta0 and zeta1 but don't use them. This
is expensive in Stan because it causes unused derivative
evaluation and chain-rule propagations.

It won't make much difference to speed in this model, but you
don't need to build up summands or alpha as vectors.
You can make them locals in the loop and just increment_log_prob()
for each of what is now summands[i] --- increment_log_prob()
automatically vectorizes its sums internally. Just define the
terms you reuse (log(sigma) and square(-sigma)) as locals outside
the loop.

Why is there a final increment_log_prob(-log(sigma)); ?

Part of the computational problem may be overly vague priors:

beta ~ normal(0, 100);
gamma ~ normal(0, 100);

If you don't expect beta or gamma to be in the 100s, don't use
priors this wide.

In general, though, I'd suggest you get the model working before trying
to optimize too heavily. Then inspect what's going on with the samples
and see if the parameters are identified or if they need tighter priors.

- Bob

Marco Inacio

unread,
Feb 8, 2014, 2:50:53 PM2/8/14
to stan-...@googlegroups.com

>
> Add a Jacobian for what? You only need a Jacobian adjustment
> for a change of variables.
Reeding the doc, I understood that log of jacobian is only needed when a
transformed is in the left hide of a sampling statement like in:

log(y) ~ normal(mu,sigma);

What if it were:

increment_log_prob(normal(log(y), mu, sigma));

Or even:

increment_log_prob( ***log-normal kernel density wrote here using
log(y)*** );

do they also need log(jacobian) increment?

>
> I take it in increment_log_prob(sum(summands)) that
> each summand is supposed to be a gamma distribution?
yes.



>
> lgamma() is very expensive to evaluate, so if the loop is large,
> this is going to be slow.
it's a small loop, only 865 elements.

>
> You want to define a local variable log_sigma =log(sigma)
> rather than continually recomputing it.
>
> Unless you want to inspect all those transformed parameters
> in the output, you can just make them local variables in the
> model.
>
> exp(log(a) - b) is more efficiently coded as a * exp(-b)
> as long as exp(-b) won't overflow or underflow.
>
> exp(2*(eta1[i] - log(sigma)))
>
> = square(exp(eta1[i] - log(sigma)))
>
> = square(-sigma * exp(eta1[i])
>
> = square(-sigma) * square(exp(eta1[i]])
>
> and now you can just save square(-sigma) and reuse it
> rather than recomputing it. Same for exp(eta1[i])
> which gets reused in the summands.
>
> You define eta0 and zeta1 but don't use them. This
> is expensive in Stan because it causes unused derivative
> evaluation and chain-rule propagations.
>
Although I do recognize this will speed up things, I don't think it's
source of the problem, I already tried this mixture-model using other
things like poisson and neg binom (I tried to model to number of claims
instead of paid losses with it, using the same predictors as for gamma
regression) and the same problem occurs.



> It won't make much difference to speed in this model, but you
> don't need to build up summands or alpha as vectors.
> You can make them locals in the loop and just increment_log_prob()
> for each of what is now summands[i] --- increment_log_prob()
> automatically vectorizes its sums internally. Just define the
> terms you reuse (log(sigma) and square(-sigma)) as locals outside
> the loop.
This is news for me, this wasn't true for lp__ <- lp__ + (.), right?


> Why is there a final increment_log_prob(-log(sigma)); ?
Was a temporary prior for the sigma parameter, I need to rethink it.


>
> Part of the computational problem may be overly vague priors:
>
> beta ~ normal(0, 100);
> gamma ~ normal(0, 100);
>
> If you don't expect beta or gamma to be in the 100s, don't use
> priors this wide.
The problem is that I don't have any prior knowledge on parameters, so
small sd might give too little weight to data. Also, setting:

beta ~ normal(0, 1);
gamma ~ normal(0, 1);
sigma ~ normal(0, 1);

Doesn't seem to help performance.

Bob Carpenter

unread,
Feb 8, 2014, 3:42:24 PM2/8/14
to stan-...@googlegroups.com


On 2/8/14, 8:50 PM, Marco Inacio wrote:
>
>>
>> Add a Jacobian for what? You only need a Jacobian adjustment
>> for a change of variables.
> Reeding the doc, I understood that log of jacobian is only needed when a transformed is in the left hide of a sampling
> statement like in:
>
> log(y) ~ normal(mu,sigma);
>
> What if it were:
>
> increment_log_prob(normal(log(y), mu, sigma));
>
> Or even:
>
> increment_log_prob( ***log-normal kernel density wrote here using log(y)*** );
>
> do they also need log(jacobian) increment?

Yes. I'll clarify that in the next version of the manual.

>> I take it in increment_log_prob(sum(summands)) that
>> each summand is supposed to be a gamma distribution?
> yes.
>
>
>
>>
>> lgamma() is very expensive to evaluate, so if the loop is large,
>> this is going to be slow.
> it's a small loop, only 865 elements.

If the data's that small, then it is curious that the model's
taking so long. I take it you've already checked that you're
using full optimization.


>> You want to define a local variable log_sigma =log(sigma)
>> rather than continually recomputing it.
>>
>> Unless you want to inspect all those transformed parameters
>> in the output, you can just make them local variables in the
>> model.
>>
>> exp(log(a) - b) is more efficiently coded as a * exp(-b)
>> as long as exp(-b) won't overflow or underflow.
>>
>> exp(2*(eta1[i] - log(sigma)))
>>
>> = square(exp(eta1[i] - log(sigma)))
>>
>> = square(-sigma * exp(eta1[i])
>>
>> = square(-sigma) * square(exp(eta1[i]])
>>
>> and now you can just save square(-sigma) and reuse it
>> rather than recomputing it. Same for exp(eta1[i])
>> which gets reused in the summands.
>>
>> You define eta0 and zeta1 but don't use them. This
>> is expensive in Stan because it causes unused derivative
>> evaluation and chain-rule propagations.
>>
> Although I do recognize this will speed up things, I don't think it's source of the problem, I already tried this
> mixture-model using other things like poisson and neg binom (I tried to model to number of claims instead of paid losses
> with it, using the same predictors as for gamma regression) and the same problem occurs.

Right --- this will just make things go faster, not change
the sampling.

>
>
>> It won't make much difference to speed in this model, but you
>> don't need to build up summands or alpha as vectors.
>> You can make them locals in the loop and just increment_log_prob()
>> for each of what is now summands[i] --- increment_log_prob()
>> automatically vectorizes its sums internally. Just define the
>> terms you reuse (log(sigma) and square(-sigma)) as locals outside
>> the loop.

> This is news for me, this wasn't true for lp__ <- lp__ + (.), right?

Correct. That was one of the advantages of switching to increment_log_prob().

>
>> Why is there a final increment_log_prob(-log(sigma)); ?
> Was a temporary prior for the sigma parameter, I need to rethink it.

I doubt this is the problem either unless sigma is getting very close
to 0. You do want to check that in your output, because very small scales
can cause problems for sampling. You can always bound it away from 0
using a lower-bound other than 0.

>> Part of the computational problem may be overly vague priors:
>>
>> beta ~ normal(0, 100);
>> gamma ~ normal(0, 100);
>>
>> If you don't expect beta or gamma to be in the 100s, don't use
>> priors this wide.
> The problem is that I don't have any prior knowledge on parameters, so small sd might give too little weight to data.
> Also, setting:
>
> beta ~ normal(0, 1);
> gamma ~ normal(0, 1);
> sigma ~ normal(0, 1);
>
> Doesn't seem to help performance.

Measured in n_eff/second? That might also be too small, which can
be a problem. If you think the parameters might be 100 or 200, then normal(0,100)
is appropriate.

- Bob

Marco Inacio

unread,
Feb 8, 2014, 6:26:50 PM2/8/14
to stan-...@googlegroups.com

If the data's that small, then it is curious that the model's
taking so long.  I take it you've already checked that you're
using full optimization.

Data is not that small, it's about 30000 rows, what happen is that only 865 have non-zero response variable and only this 865 are in the loop.

The zeros are captured in the bernoulli $1-w$
The non-zeros are captured in the bernoulli $w$ multiplied by the gamma-density.

In a simple fashion, this mixture model could be described:

parameter {
vector beta_vector; //vector for the gamma regression
vector celta_vector; //vector for bernoulli regression
...
}
model {
  for (i in NROW) {
    if (response_var[i] == 0) {
      0 ~ bernoulli_logit(covariates_matrix[i] * celta_vector);
    }
    else {
      1 ~ bernoulli_logit(covariates_matrix[i] * celta_vector);
      //instead of gamma we could have other things like normal, poisson, negbin, binomial, etc.
      response_var[i] ~ gamma_in_mean/sd_parametrization(covariates_matrix[i] * beta_vector, sigma);

    }
  }
}

$\begin{array}{rlll} f(y; \theta, w) & = & (1 - w)
      & \textrm{, if } y = 0 \\  f(y; \theta, w) & = & w *
      g(y; \theta) / g(0; \theta) & \textrm{, if } y \neq 0
      \end{array}$


I really think now that this problem is related to the geometry of the posterior, but how to mitigate it?





I doubt this is the problem either unless sigma is getting very close
to 0.  You do want to check that in your output, because very small scales
can cause problems for sampling.  You can always bound it away from 0
using a lower-bound other than 0.
Tring  real <lower=2> sigma; doesn't seem to increase speed





Measured in n_eff/second?  That might also be too small, which can
be a problem.  If you think the parameters might be 100 or 200, then normal(0,100)
is appropriate.
It took about 10 hours to get 100 samples, with n_eff=3 for some variables. Sadly, to test the full impact of these changes I'll have to wait those 10 hours again judging by eye that the speed didn't increase. =[
Reply all
Reply to author
Forward
0 new messages