Hi All,
I am in the middle of coding a Bayesian mixture model in STAN and get stuck during the process. I would very much like to seek for your help.
This is link to the paper (model was discussed in section 2 Methods): http://bioinformatics.oxfordjournals.org/content/30/15/2098.long
In summary, the paper proposed a folded normal-gamma mixture model for the z-scores of SNPs:
f(z_i) = pi_0 f_0(z_i) + pi_1 f_1(z_i),
where
1) f_0(z) = 2 phi(z) I_{z >=0}, where phi(z) is the normal density with mean 0 and standard deviation sigma_0, and I_{z>=0} is an indicator function that takes 1 when z >=0 and 0 otherwise. Additionally, if |z| < 0.68, that SNP is considered null.
2) f_1(z) ~ gamma( a(x), beta ); covariates x are used to modulate the shape parameter a(x) = exp( x’ * alpha ) with vector alpha following a normal distribution and the rate parameter beta is assumed to have a gamma distribution.
3) pi_0 + pi_1 = 1 and covariates x again is used to modulate the non-null portion via a logistic regression
pi_1(x) = Pr (delta_i = 1 | x ) = inv_logit( exp(x’ * gamma))
where deltas were generated with a Bernoulli full conditional distribution, for k in {0, 1}
Weakly informative priors to unknown parameters (alpha, beta, gamma, sigma_0^2) were:
alpha ~ Normal( 0, Sigma_alpha )
beta ~ Gamma( a_0, b_0)
gamma ~ Norma( 0, Sigma_gamma )
sigma_0^2 ~ InverseGamma( a_0, b_0 )
where the diagonal of both Sigma’s has variance 10000 and a_0 = b_0 = .001
The stan code I came up is shown below, as well as attached in the post:
functions {
real cal_theta(real xgamma, real a, real z, real mu, real beta, int ind, real sigmasq){
/*
Function cal_theta to calculate the parameter theta which will be
used to draw the latent indicator delta (0 null; 1 nonnull) following
Bernoulli distribution
*/
real log_P_delta[2];
real P_delta[2];
log_P_delta[1] <- xgamma;
log_P_delta[2] <- 0;
log_P_delta[1] <- log_P_delta[1] + ( (a-1)*log(fabs(z) - mu) -
beta*(fabs(z) - mu) ) + a*log(beta) - log(exp(lgamma(a)));
log_P_delta[2] <- log_P_delta[2] +
log(2) - 0.5*log(2*pi()*sigmasq) - (1/(2*sigmasq))*z^2;
P_delta[1] <- exp(log_P_delta[1]/log_sum_exp(log_P_delta));
P_delta[2] <- exp(log_P_delta[2]/log_sum_exp(log_P_delta));
if (ind == 1){
P_delta[1] <- 1/(1 + exp(log_P_delta[2] - log_P_delta[1]));
}
P_delta[2] <- 1/(1 + exp(log_P_delta[1] - log_P_delta[2]));
if (ind == 0){// nulls
P_delta[1] <- 0;
P_delta[2] <- 1;
}
return P_delta[2];
}
}
data {
int N; // number of SNPs
int K; // number of the covariances
matrix[N, K+1] X; // matrix of covariances, first column all ones (intercept)
real Z[N]; // z-scores (observations)
// location parameter for truncation
real<lower=0> mu; // contant cutoff = .68
}
transformed data {
// covariance matrix for alpha & beta
matrix[K+1, K+1] Sigma;
Sigma <- diag_matrix(rep_vector(10000, K+1));
}
parameters {
vector[K+1] alpha; // Normal: shape parameters of gamma in the mixture
vector[K+1] gamma; // Normal: fitting through logistic regression for pi
real beta; // gamma: rate parameters of gamma in the mixture
real<lower=0> sigmasq;// inverse gamma: squared scale parameter for the normal in the mixture
real<lower=0, upper=.5> pi1;
}
transformed parameters {
vector[N] a; // shape parameter
vector[N] xgamma; // dependence of pi1 on x
real sigma; // scale parameter
a <- exp(X*alpha);
xgamma <- X*gamma;
sigma <- sqrt(sigmasq);
}
model {
real ps[2]; // temp for log component densities
real theta[N]; // probability when delta is 1
int delta[N]; // latent indicators
alpha ~ multi_normal(rep_vector(0, K+1), Sigma);
gamma ~ multi_normal(rep_vector(0, K+1), Sigma);
beta ~ gamma(.001, .001);
sigmasq ~ inv_gamma(.001, .001);
// calculate theta, for drawing the latent indicators
for (n in 1:N){
if (fabs(Z[n]) <= mu) {
theta[n] <- cal_theta(xgamma[n], a[n], Z[n], mu, beta, 0, sigmasq);
} else {
theta[n] <- cal_theta(xgamma[n], a[n], Z[n], mu, beta, 1, sigmasq);
}
}
delta ~ bernoulli(theta);
// two group mixture model
for (n in 1:N){
if (fabs(Z[n]) <= mu )
// f0 folded normal
ps[1] <- log(2) + log1m(pi1) + normal_log(fabs(Z[n]), 0, sigma);
else
// f1 gamma
ps[2] <- delta[n]*(log(pi1) + gamma_log(fabs(Z[n] - mu), a[n], beta));
increment_log_prob(log_sum_exp(ps));
}
}
At first I just called “stan” and got several error messages suggesting that I should specify the initial values. Now I supplied a initial function “initf1” and call run.R (both the data and run script are attached)
load("simData.RData")
data <- list(N = length(Z),
K = ncol(X)-1,
X = X,
Z = Z,
mu = .68)
K <- ncol(X)-1; # number of covariates
initf1 <- function() {
list(alpha = rep(0, K+1),
gamma = c(logit(.05), rep(0, K)),
beta = 0.1,
sigmasq = 1,
pi1 = .05
)
}
# Fit the model with Stan
fit <- stan(#file = "/space/synwes/1/sxu/R/packages/pathLoo/example/gamma.stan",
file = "~/R/bayesFactor_pathway/selectPath/gamma.stan",
data = data, chains = 1, iter = 2000, warmup = 200, thin = 4,
init = initf1,
seed = 123)
Now in addition to the warning message before the run:
DIAGNOSTIC(S) FROM PARSER:
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If so, you need to call increment_log_prob() with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
delta ~ bernoulli(...)
I’m also getting an error message aborting the run
SAMPLING FOR MODEL 'gamma' NOW (CHAIN 1).
[1] "Error : Rejecting initial value:"
[2] " Log probability evaluates to log(0), i.e. negative infinity."
[3] " Stan can't start sampling from this initial value."
error occurred during calling the sampler; sampling not done
I obtained these initial values from the software generously shared by the authors of the paper.
In addition, the simulation data X (10000 by 4) and Z (10000) are generated with parameters
alpha <- c(1.3, .03, .3, .4)
beta <- 2
gamma <- c(-2.94, 1, .5, -.5)
sigmaSq <- 1.2
Please let me know if there’s any additional information I could provide for troubleshooting. As a newbie (I started this last week), I didn't know how to debug the program, e.g., when it complaints about the log(0), which log is it calculating? How to locate it? Also, I wasn't quite sure about my way coding latent indicators in stan. I'd really appreciate any input.
Thank you very much in advance for taking a look at this long question! Any insight would be welcome!
Thanks!
Flora
--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/iAunwddMneA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.