On 5/27/13 7:08 PM, Luc Coffeng wrote:
> Dear Group-members, Stan developers,
>
> I am trying to fit a latent K-distribution model to data. In short, a K-distribution is the distribution of the product
> of two random, gamma-distributed variables. In my case, the latter two random variables have expectation 1.0 (i.e. each
> distribution has parameters alpha == beta), and for one of the random variables I also know the variance (alpha == beta
> == 3.5, so variance is 1/3.5). My objective is to estimate the unknown variance of the other (unobserved) variable,
> given (simulated) data.
>
> To make things a little more complicated, the K-distributed variable of interest is unobserved (latent), and I only know
> whether it is above or below some threshold value. This means that I have to define the likelihood of the data in terms
> of the CDF for the K-distribution. After some hours of skull-cracking thinking, I (think I) managed to come up with a
> formulation based on the CDF of the inverse gamma distribution. This approach is based on the nice feature of the gamma
> distribution that when c*X ~ gamma(alpha,beta), X ~ gamma(c*alpha,beta).
The equations are easy enough to set up:
K(y|a1,b1,a2,b2) = INT gamma(u|a1,b1) * gamma(y/u|a2,b2) du
No Jacobian needed here. But you need to evaluate
the integral, and I'm terrible at this. And then
evaluate the the integral for the CDF.
> The model compiles succesfully. However, when I run the model, it goes haywire. Below I posted the model syntax for
> rstan (v1.3.0) which I used in R 3.0.0 on my MacBook Pro. Mind that because for each gamma-distribution alpha==beta,
> only one parameter (shape1 or shape2) is specified for each gamma distribution. Further down, I posted the error message
> :D. I suspect that I may be missing something Jacobian-related in the log-likelihood with regards to the first random
> gamma-distributed variable for which I know the variance (and which is used to scale the second gamma-distributed
> variable with unknown variance), but I'm not sure what it is. My brain is stuck there, and I'm not even sure whether
> this is the reason why rstan goes haywire (forgetting stuff in the log-likelihood function should just lead to wrong
> parameter estimates, not a model crashing).
Crashing is a bug and we're looking into fixing that
for the next release --- we need to initialize local
variables.
> thres <- 2;
> sim_data <- rgamma(N,1/var1,1/var1) * rgamma(N,1/var2,1/var2) > thres;
>
> DATA <- list(
> N = N,
> outcome = as.numeric(sim_data),
> shape1 = 1/var1,
> thres = thres
> );
>
> # Model specification in STAN language
> CODE <- '
> data {
> int<lower=1> N; // Number of observations;
> // Outcome (0 = no, 1 = yes);
> real<lower=0> shape1; // Shape parameter of the first gamma distribution (with mean one);
> real<lower=0> thres; // Disease threshold;
> }
> parameters {
> real<lower=0> inv.sqrt.shape2; // Standard deviation of the second gamma distribution (with mean one);
> }
> transformed parameters {
> real<lower=0> shape2; // Shape of the second gamma distribution (with mean one);
> shape2 <- pow(inv.sqrt.shape2,-2);
As Stan's written now, it's more efficient to write
shape2 <- 1 / (iss2 * iss2);
We'll eventually rewrite pow() so that it recognizes the -2
exponent as a special case.
> }
> model {
> real rscale[N];
>
> for (i in 1:N) {
> rscale[i] ~ gamma(shape2,shape2);
> lp__ <- lp__ + log( (1-outcome[i]) * inv_gamma_cdf(1/thres,rscale[i]*shape1,shape1) + outcome[i] *
> (1-inv_gamma_cdf(1/thres,rscale[i]*shape1,shape1)) );
> }
> }
The real problem here conceptually for Stan is that the ~ is
not an assignment operator. So you can't take an undefined local
variable rscale[i] and use ~ on it. The way Stan works,
y ~ foo(theta);
translates to the same thing as:
lp__ <- lp__ + foo_log(y,theta);
So you need to define rscale as a parameter.
That's not so easy because of the constraint. Stan only allows homogeneous
constraints on parameters. So what you need to do is break the data
down into an array of constrained-above and an array of constrained below
values. Say their counts are N1 and N2, then you'd have:
real<lower=0,upper=1/thres> rscale_upper[N1];
real<lower=1/thres> rscale_lower[N2];
Then just sample without the (1 - outcome[i]) and outcome[i] indicators in
two separate loops.
The values N1 and N2 and rscale_upper and rscale_lower can all be
calculated in Stan.
Coding this the way you did in Stan is very inefficient because all
the functions get calculated. Instead of:
> lp__ <- lp__ + log( (1-outcome[i]) * inv_gamma_cdf(1/thres,rscale[i]*shape1,shape1) + outcome[i] *
> (1-inv_gamma_cdf(1/thres,rscale[i]*shape1,shape1)) );
you would have wanted (if you didn't have to break rscale in two
because of the constraints):
if (outcome[i])
lp__ <- lp__ + log(1 - inv_gamma_cdf(1/thres,rscale[i]*shape1,shape1));
else
lp__ <- lp__ + log(inv_gamma_cdf(1/thres,rscale[i]*shape1,shape1));
For arithmetic stability, you also want log1m(z) instead of log(1 - z)
everywhere.
- Bob