How to reduce number of divergent transitions? Re-parameterizing a truncated distribution.

449 views
Skip to first unread message

Qian Zhang

unread,
Aug 17, 2016, 6:43:24 PM8/17/16
to stan-...@googlegroups.com
Thanks for considering my questions. 

I am trying to sample from the attached model: Post_to_Stan_Model.pdf, re-typed here:

L-long vector of pi_i's ~ U(0,1), i in {1, ..., L}
J-long vector of C_jj's ~ U(0,1), j in {1, ..., J}
L x J matrix of Alpha[i,j]'s ~ N(mu = pi_i, sigma = sqrt(C_jj * pi_i * (1 - pi_i))), truncated with atoms at 0 and 1

Data: L x J matrix of X[i,j]'s ~ U(0,1), setting the likelihood to 1.
 
I have set the likelihood to 1, because I'm trying to use STAN to sample from the prior. This way I can use R to generate Monte Carlo draws from the prior by drawing down the hierarchical model graph, and compare Monte Carlo draws to draws from STAN, in order to check that I am successfully getting STAN to draw from the intended target distribution. I'll add the likelihood in later.

The problem is I'm getting divergent transitions (Fit 12 below). F12 had 1611 divergent transitions.

From reading Chap 5 of the Handbook of MCMC, bits of this guy's thesis (http://www.cs.toronto.edu/~radford/ftp/kiam-thesis.pdf), and section 60.4 of the STAN manual (stan-reference-2.11.0.pdf), I think divergent transitions are problematic, because they indicate that the leapfrog algorithm is not approximating Hamiltonian dynamics well. The Hamiltonian H is supposed to be constant over time but the leapfrog algorithm results in H changing excessively over time i.e. "diverging" and maybe this affects whether the Metropolis proposal informed by the leapfrog algorithm results in a Markov chain with the desired invariant distribution. In Fit 12, the posterior means for C_11 and C_22 are 0.51 with a SE of the mean < 0.005 when they should both have posterior mean 0.5, according to the prior. I think divergent transitions are causing the posterior means to be off.

I tried increasing adapt_delta, decreasing stepsize, and increasing max_treedepth, which unexpectedly increased the number of divergent transitions relative to Fit 12 (Fit 13 below). Fit 13 had 1738 divergent transitions. How can I get STAN's warnings (e.g. about divergent transitions) to print when I call an R script from the command line? The number of divergent transitions only showed up in an interactive R session.

I then tried reparameterizing, but ran into a syntax error (Fit 8 below). 

How should I change Fit 8's try6.stan file to avoid the syntax error so that I can re-parameterize the model and reduce the number of divergent transitions? Else, could you suggest any ways besides the ones I've tried to reduce the number of divergent transitions? 

Fit 12


R command for Fit 12:


# Args to call_rstan.R: datafile, L, STAN_modelfile, index of fit. 

# Optionally: adapt_delta,stepsize, max_treedepth. 

Rscript call_rstan.R sim_data.RData 3 try8.stan 12


Pairs plot is Pairs_fit_12.pngand shows many red divergent transitions. 


The script call_rstan.R calls stan with this code:     


fit <- stan(file=STAN_modelfile,

                data=list(J, L, X, N),

                iter=11000, chains=4, seed = 1)


where J = 2, L = 3, X and N are both L x J matrices, coming from sim_data.RData.


Fit 13: The change from Fit 12 is I have increased adapt_delta to 0.99, reduced stepsize to 0.01, and increased max_treedepth to 15, as suggested on a STAN user list thread.


R command for Fit 13:


# Args to call_rstan.R: datafile, L, STAN_modelfile, index of fit. 

# Optionally: adapt_delta, stepsize, max_treedepth. 

Rscript call_rstan.R sim_data.RData 3 try8.stan 13 0.99 0.01 15


The script call_rstan.R calls stan with this code:     


fit <- stan(file=STAN_modelfile,

                data=list(J, L, X, N),

                iter=11000, chains=4, seed = 1,

                control=list(adapt_delta=adapt_delta,

                             stepsize = stepsize,

                             max_treedepth = max_treedepth))


Pairs plot is Pairs_fit_13.pngand shows many red divergent transitions. 


Fit 8: Here I try to re-parameterize Alpha[i,j], because both pairs plots (for Fits 12 and 13) showed cigar shapes (i.e. contours with narrow width) for the Alpha[i,j] parameters. I tried following the funnel example in section 22.2 of the stan manual stan-reference-2.11.0.pdf to get STAN to sample Alpha_raw[i,j] ~ N(0,1), but ran into a syntax error.


R command for Fit 8:


# Args to call_rstan.R: datafile, L, STAN_modelfile, index of fit. 

# Optionally: adapt_delta,stepsize, max_treedepth. 

Rscript call_rstan.R sim_data.RData 3 try6.stan 8 0.99 0.01 15


Syntax error:


 35:       for (i in 1:L) {

 36:           for (j in 1:J) {

 37:               Alpha[i,j] = pi_vec[i] + sqrt(C_vec[j]*(pi_vec[i])*(1 - pi_vec[i])) * Alpha_raw[i,j] T[0,1];

                                                                                                        ^

 38:           }


PARSER EXPECTED: ";"

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 

  failed to parse Stan model 'try6' due to the above error.

Calls: stan -> stan_model -> stanc

Execution halted


If you'd rather see try6.stan in this textbox:

data{


   int<lower=0> J;

   int<lower=0> L;


   // Declare arrays with integer entries.


   int X[L,J];

   int N[L,J];


}


parameters {


   // Add parameters:

   // - pi_vec = [pi_1, ..., pi_L]

   // - C_vec = [C_11, ..., C_JJ]


   vector<lower=0,upper=1>[L] pi_vec;


   vector<lower=0,upper=1>[J] C_vec;


   // Re-parameterize

   // a la funnel example 22.2 of stan-reference-2.11.0.pdf

   matrix<lower=0,upper=1>[L,J] Alpha_raw;


}


transformed parameters {

   // The parameters we are actually interested in,

   // a la funnel example 22.2 of stan-reference-2.11.0.pdf


   matrix<lower=0,upper=1>[L,J] Alpha;


   for (i in 1:L) {

       for (j in 1:J) {

           Alpha[i,j] = pi_vec[i] + sqrt(C_vec[j]*(pi_vec[i])*(1 - pi_vec[i])) * Alpha_raw[i,j] T[0,1];

       }

   }

}


model {


    for (i in 1:L) {

        pi_vec[i] ~ uniform(0,1);

    }


    for (j in 1:J) {

        C_vec[j] ~ uniform(0,1);

    }


    for (i in 1:L) {

        for (j in 1:J) {


            // implies Alpha[i,j] ~ normal(pi_vec[i], sqrt(C_vec[j]*(pi_vec[i])*(1 - pi_vec[i]))) T[0,1];

            Alpha_raw[i,j] ~ normal(0,1);


            // Alpha[i,j] ~ normal(pi_vec[i], sqrt(C_vec[j]*(pi_vec[i])*(1 - pi_vec[i]))) T[0,1];


            // For the (Like = 1) test, have X[i,j] ~ U(0,1),

            // i.e. set the likelihood's density to 1 so that posterior density = prior density.


            X[i,j] ~ uniform(0,1);


        }

    }


}
















call_rstan.R
Pairs_fit_12.png
Pairs_fit_13.png
Post_to_Stan_Model.pdf
sim_data.RData
try6.stan
try8.stan

Bob Carpenter

unread,
Aug 17, 2016, 7:07:19 PM8/17/16
to stan-...@googlegroups.com
I'm very confused by your intentions here. Why would you use a normal
truncated at (0, 1) rather than, say, a beta distribution? When
you say "atom", you just mean constants?

I also don't understand your graphical notation. Can you write
out the joint density p(param1, ..., paramM, data1, ..., dataN)?
Usually that's of the form of a prior p(param1, ..., paramM)
times a likelihood p(data1, ..., dataN | param1, ..., paramM).
As soon as we understand that, we can help, but it's hard to
infer from a pile of models with questions scattered throughout.

If you only want to sample from the prior, you just need to
code p(param1, ..., paramM) rather than trying to hack around
a likelihood.

As to the Stan syntax error, you can only truncate sampling notation,
not arbitrary statements.

I have no idea about the R questions, but you may want to post those
in a separate message to get the R devs attention.

- Bob



> On Aug 18, 2016, at 12:43 AM, Qian Zhang <qian.s...@gmail.com> wrote:
>
> Thanks for considering my questions.
>
> I am trying to sample from the attached model: Post_to_Stan_Model.pdf, re-typed here:
>
> L-long vector of pi_i's ~ U(0,1), i in {1, ..., L}
> J-long vector of C_jj's ~ U(0,1), j in {1, ..., J}
> L x J matrix of Alpha[i,j]'s ~ N(mu = pi_i, sigma = sqrt(C_jj * pi_i * (1 - pi_i))), truncated with atoms at 0 and 1
>
> Data: L x J matrix of X[i,j]'s ~ U(0,1), setting the likelihood to 1.
>
> I have set the likelihood to 1, because I'm trying to use STAN to sample from the prior. This way I can use R to generate Monte Carlo draws from the prior by drawing down the hierarchical model graph, and compare Monte Carlo draws to draws from STAN, in order to check that I am successfully getting STAN to draw from the intended target distribution. I'll add the likelihood in later.
>
> The problem is I'm getting divergent transitions (Fit 12 below). F12 had 1611 divergent transitions.
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <call_rstan.R><Pairs_fit_12.png><Pairs_fit_13.png><Post_to_Stan_Model.pdf><sim_data.RData><try6.stan><try8.stan>

Qian Zhang

unread,
Aug 17, 2016, 8:12:52 PM8/17/16
to Stan users mailing list
I am trying to implement somebody else's model and they want a truncated normal. 

With the likelihood P(data1, ..., dataN | param1, ..., paramM ) set to 1,

P(param1, ..., paramM, data1, ..., dataN) = P(data1, ..., dataN | param1, ..., paramM )P(param1, ..., paramM ) = P(param1, ..., paramM ) = Product of densities of (Alpha[i,j] | Cjj, pi_i)'s, i.e. \Prod_{i =1}^L \Prod_{j =1}^J P(Alpha[i,j] | Cjj, pi_i),

where (Alpha[i,j] | Cjj, pi_i) is distributed the same as a normal random variable with mean pi_i and variance (C_jj)(pi_i)(1 - pi_i) before truncation, but then move all the mass below 0 to be at 0 and all the mass above 1 to be at 1. 

More detail is given on page 2 of the attached pdf under "Factoring the Posterior." 

I just found your thread 


on reparameterizing when you have a parameter of interest that is truncated. I'll read that while I wait on your response.
Post_to_Stan_Model_v2.pdf

Bob Carpenter

unread,
Aug 17, 2016, 8:26:13 PM8/17/16
to stan-...@googlegroups.com
As much as I'd like to help, I can't follow any of that
because the notation is completely unfamiliar and it's a
massive definition.

I think the assumptions about dtnorm in there are wrong (or
the package implementing dtnorm is broken), because
truncation does change means and variances. It also sounds
like you're defining censoring, not truncation, which results
in a mixed density/mass function, but as I said, this is very
hard for me to follow.

If you could ask a simpler question, or actually define
a density in standard statistical notation, we might be
able to help.

Or maybe someone else might be able to help.

Our general advice is to start with a small model that works
and gradually add features to it. Starting with these massive
definitions is a recipe for failure, even for us.

- Bob
> <Post_to_Stan_Model_v2.pdf>

Qian Zhang

unread,
Aug 17, 2016, 10:34:08 PM8/17/16
to stan-...@googlegroups.com
What msm means by the truncated normal density is given in blue on page 4 of the attached pdf. It is also the same as the pdf given here: https://en.wikipedia.org/wiki/Truncated_normal_distribution.

Basically, if X ~ Truncated Normal(mu, sigma^2, a, b), then the pdf of X is f_X(x) = f_N(mu,sigma^2)(x) / [F_N(mu,sigma^2)(b) - F_N(mu,sigma^2)(a)], where f_N(mu,sigma^2) and F_N(mu,sigma^2) are the pdf and cdf of a normal random variable with mean mu and variance sigma^2. 

The parameters (Alpha[i,j] | Cjj, pi_i) are distributed truncated normal(mu = pi_i, sigma = sqrt(Cjj * pi_i * (1 - pi_i)), lower bound a = 0, upper bound b = 1).

Basically, I want to know how to perform Fit 12 of my original post (i.e. sample from my posterior, which is a product of these (Alpha[i,j] | Cjj, pi_i) densities) without getting divergent transitions.

Divergent transitions concern me because I know they're an indication that the leapfrog algorithm isn't approximating Hamiltonian dynamics "well" in some sense, and when I print the object coming out of stan() in Fit 12, the posterior means for C11 and C22 appear a bit off:

             mean se_mean   sd   2.5%    25%    50%    75% 97.5% n_eff Rhat

pi_vec[1]    0.50    0.00 0.28   0.03   0.26   0.50   0.74  0.97  7842    1

pi_vec[2]    0.50    0.00 0.29   0.03   0.25   0.50   0.74  0.97  7796    1

pi_vec[3]    0.50    0.00 0.28   0.03   0.26   0.50   0.74  0.97  8435    1

C_vec[1]     0.51    0.00 0.28   0.03   0.26   0.51   0.75  0.97  7722    1

C_vec[2]     0.51    0.00 0.28   0.04   0.26   0.51   0.74  0.97  7504    1

Alpha[1,1]   0.50    0.00 0.29   0.02   0.25   0.50   0.75  0.97  8189    1

Alpha[1,2]   0.50    0.00 0.29   0.02   0.24   0.50   0.75  0.97  8677    1

Alpha[2,1]   0.50    0.00 0.29   0.03   0.25   0.50   0.76  0.97  8343    1

Alpha[2,2]   0.49    0.00 0.29   0.02   0.24   0.49   0.75  0.97  8329    1

Alpha[3,1]   0.50    0.00 0.29   0.02   0.25   0.50   0.76  0.97  9386    1

Alpha[3,2]   0.50    0.00 0.29   0.03   0.25   0.50   0.75  0.97  8917    1

lp__       -13.48    0.04 2.71 -19.54 -15.09 -13.16 -11.55 -9.03  4513    1


The posterior is the prior, and in the prior, C_vec[1] i.e. C11 ~ U(0,1) and C_vec[2] i.e. C22 ~ U(0,1), so E(C_vec[1]) = E(C_vec[2]) = 0.5, but STAN returns C_vec[1] and C_vec[2] posterior means of 0.51 and 0.51 respectively. Maybe that's close enough to 0.5, but 0.51 -/+ 2*se_mean is narrower than 0.51 -/+ 2*(0.005) = (0.5, 0.52), which doesn't quite cover 0.5.

Also I don't care about samples from the prior per se. My only interest in the prior is to check that I was getting STAN to sample correctly, because when I sample from the prior, I have a "right answer" (i.e. I know that E(C_jj) = 0.5) to compare a posterior mean to, but when I sample from a real posterior that isn't the same as the prior because the likelihood isn't set to 1, I don't have a right answer to compare to (which is why I'm doing MCMC).

Based on your thread (https://groups.google.com/forum/#!searchin/stan-users/reparameterizing$20truncated$20distributions|sort:relevance/stan-users/LsO6DAk0w58/IaN-itZhZK0J) I have an idea for how to do this Matt Trick for my truncated normal on the Alpha[i,j]'s, but I need to specify different bounds on different elements of the matrix Alpha. I will think about this tomorrow.
Post_to_Stan_Model_v3.pdf

Bob Carpenter

unread,
Aug 18, 2016, 6:37:12 AM8/18/16
to stan-...@googlegroups.com



> On Aug 18, 2016, at 4:34 AM, Qian Zhang <qian.s...@gmail.com> wrote:
>
> What msm means by the truncated normal density is given in blue on page 4 of the attached pdf. It is also the same as the pdf given here: https://en.wikipedia.org/wiki/Truncated_normal_distribution.

That's the same distirbution as you'd get from Stan. It
does *not* put all the mass for observations outside of
the bounds at the bounds.

> The parameters (Alpha[i,j] | Cjj, pi_i) are distributed truncated normal(mu = pi_i, sigma = sqrt(Cjj * pi_i * (1 - pi_i)), lower bound a = 0, upper bound b = 1).

This doesn't make sense, but you probably just mean that Alpha[i, j]
has the truncated normal distribution given C[j, j], p[i, i]?, and
a and b. That can be coded directly in Stan using truncation, but
if the Alpha[i, j] are near the 0, 1 boundaries, it's going to
be very unstable numerically.

> Basically, I want to know how to perform Fit 12 of my original post (i.e. sample from my posterior, which is a product of these (Alpha[i,j] | Cjj, pi_i) densities) without getting divergent transitions.

I didn't understand the model, so it's hard to suggest
fixes. Assuming it's coded correctly, that usually requires
a reparameterization if simply reducing the step size doesn't
help. Some of these models have nasty curvatures that Euclidean
HMC has a hard time with.

To test if it's coded correctly, you want to simulate data
from the model and see if you can recover the parameters.

- Bob
> <Post_to_Stan_Model_v3.pdf>

Reply all
Reply to author
Forward
0 new messages