Transdimensional MCMC for model comparison in Stan

Fil Krynicki

unread,

Feb 18, 2014, 2:39:26 PM2/18/14

to stan-...@googlegroups.com

Hi again,

I should preface this by saying most of what I know of Bayesian modelling is coming from the Kruschke book (http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/), and I'm hopeful the terminology is consistent here.

In chapter 10 of the book, Kruschke describes doing model comparison with Transdimensional MCMC. You have two models with different parameterizations, and want to compare them. The gist of it is you have a model of the following form (translated by me from JAGS into Stan form):

data {
    int N;
    real y[N];
    real x[N];
}

parameters {
    int<lower=1, upper=2> modelIndex;
    vector [2] model_probs;
    ...
    m1_params
    ...
    m2_params
    ...
}

model {
    modelIndex ~ categorical(model_probs);

    if (modelIndex == 1) {
        y ~ some_distribution(m1_params)
    } else {
        y ~ some_other_distribution(m2_params)
    }
}

Afterwards, you compare the distribution over modelIndex to quantify your belief in one model versus the other.

Now the obvious problem is that Stan doesn't allow integer parameters. As a result, to do this kind of comparison, I'm looking at translating my models to JAGS and having to tread into the land of R (I'm a PyStan user primarily). Before I take on this painful process, I was wondering - is there any way around this problem to successfully do transdimensional MCMC in Stan? How is model comparison normally done in Stan?

Fil Krynicki

unread,

Feb 18, 2014, 2:41:56 PM2/18/14

to stan-...@googlegroups.com

I should note that I have seen the post here: https://groups.google.com/forum/#!searchin/stan-users/transdimensional/stan-users/3i6v35TfRYk/OBzzdcuBHQ0J, the solution to which involves a marginalization that is model-specific. I'm not able to make the leap from that to more general solutions (if there is one).

Andrew Gelman

unread,

Feb 18, 2014, 3:00:00 PM2/18/14

to stan-...@googlegroups.com

Hi, I generally think this is a bad idea. See sections 7.4-7.5 of BDA3 for further discussion.

A

--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bob Carpenter

unread,

Feb 18, 2014, 3:03:58 PM2/18/14

to stan-...@googlegroups.com

We don’t do a lot of model comparison in Stan.

I think “transdimensional MCMC” must be Kruschke’s rather non-standard term
for what’s essentially a mixture model where you’re going to look at the mixture
weighting. What I’m guessing Andrew would tell you to do is just make predictions
with the mixture model.

You can code it in Stan for arbitrary models as follows:

In your model, define two local variables lp1 and lp2 for the log probabilities
of each model. Then increment those variables with the log probs of the models.
Then use a mixture model (there’s a chapter in the Stan manual).

In your simple case:

> model {
> modelIndex ~ categorical(model_probs);
>
> if (modelIndex == 1) {
> y ~ some_distribution(m1_params)
> } else {
> y ~ some_other_distribution(m2_params)
> }
> }

it’d look like this (in more traditional stats/CS notation):

parameters {
real<lower=0,upper=1> lambda;
…
}
model {
real lp1;
real lp2;
lp1 <- foo_log(y,theta1);
lp2 <- bar_log(y,theta2);
increment_log_prob(log_sum_exp(log(lambda) + lp1,
log1m(lamba) + lp2));
}

The key is that you’re setting up a mixture model with mixing ratio lambda:

p(y | theta1,theta2,lamba) = lambda * foo(y | theta1) + (1 - lambda) * bar(y | theta2).

The rest is just algebra, keeping in mind that we’re working on the log scale and

log_sum_exp(a,b) = log(exp(a) + exp(b))

and

log1m(c) = log(1 - c)

Those forms are much more arithmetically stable.

This should be much more efficient than Gibbs sampling lambda, because of the
Rao-Blackwell theorem (which, to roughly summarize, says you’re better off using expectations
than sampling).

If your models are more complicated, you need to set lp1 and lp2 to
their respective log probs. Probability easiest to initialize lp1 and lp2 to 0
and then increment for each statement.

If you have more than 2 models, you have to declare lambda as a simplex[K],
declare lp as a vector[K], increment lp as usual for each model,
then use the vector form of log_sum_exp, which will just be:

increment_log_prob(log_sum_exp(log(lambda) + lp));

because log() is vectorized and + is just vector arithmetic. In fact, it might be
easier to follow if you do it this way even for two variables:

parameters {
simplex[2] lambda;
…
}
model {
vector[2] lp; real lp1;
lp[1] <- foo_log(y,theta1);
lp[2] <- bar_log(y,theta2);
increment_log_prob(log_sum_exp(log(lambda) + lp));
}

I should put this in the manual!

- Bob

On Feb 18, 2014, at 8:39 PM, Fil Krynicki <filipk...@gmail.com> wrote:

Andrew Gelman

unread,

Feb 18, 2014, 3:38:34 PM2/18/14

to stan-...@googlegroups.com

No, the issue is that Andrew doesn’t like this kind of mixture model. He prefers continuous model expansion rather than discrete model averaging.

David J. Harris

unread,

Feb 18, 2014, 6:51:39 PM2/18/14

to stan-...@googlegroups.com, gel...@stat.columbia.edu

Bob: My understanding is that "transdimensional MCMC" is actually a standard term with a fairly active research community, and that the associated mixture models can be intractable without specialized methods like "reversible jump" MCMC. It seems to be pretty useful for clustering-type problems, e.g. detecting change-points when the number of different states is unknown.

In case you're interested, there's a nice-looking (but somewhat old) review of the field here: http://www.isds.duke.edu/~scs/Courses/Stat376/Papers/TransdimMCMC/SissonJASA2005.pdf I haven't thought about these methods in about 5 years, but my recollection is that reversible jump MCMC isn't particularly hard to implement (although it requires a lot of tuning and very long runs to make it actually work).

Dave

Michael Betancourt

unread,

Feb 18, 2014, 7:12:22 PM2/18/14

to stan-...@googlegroups.com

Long story short any explicit transdimensional algorithms won’t be in Stan

for a long time. The problem with trying to build up an explicit mixture

model is in building the mixture probabilities correctly.

Bob Carpenter

unread,

Feb 19, 2014, 10:16:19 AM2/19/14

to stan-...@googlegroups.com

Thanks for the clarification. I’m familiar with reversible jump MCMC, but didn’t realize
they called it “trans-dimensional”. But as Michael said, we have no plans to implement
it in Stan.

It’s not needed in this particular case, which is just a traditional mixture model.

- Bob

Fil Krynicki

unread,

Feb 19, 2014, 10:22:55 AM2/19/14

to stan-...@googlegroups.com

Thanks for all the information. I have a lot to think about and build on.

Filip Krynicki

unread,

Mar 5, 2014, 11:32:31 AM3/5/14

to stan-...@googlegroups.com

Is it normal in a mixture model of the type Bob described for one of the models not to converge? One of my models converges to the roughly the same values as it does on its own, and lambda rapidly and heavily favors that model.

In the Kruschke book, they talk about avoiding this for Transdimensional MCMC with long chains or by manually initializing the parameter values with their fit for the standalone model. Does this apply as well in the mixture model context and Stan?

--
You received this message because you are subscribed to a topic in the Google Groups "stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/CXAT_XUTfwE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,

Mar 5, 2014, 12:08:35 PM3/5/14

to stan-...@googlegroups.com

Reasonable initializations always help. And you should always run
models until everything converges, or fix the models so that they do.

Initializing a mixture with individual fits is not usually a reasonable
estimate of the mixture itself. For instance, if you mix two normals,
they'll both be initialized to the same fit, which is not where they'll go
in the mixture itself.

What can happen in mixtures is that one component will dominate and
the other will be harder to estimate. This should be less problematic
in Stan because we're marginalizing out the mixture ratio than in
a Gibbs sampler for the discrete responsibility parameter.

Or to take the normal again, it's hard to fit a normal mixture to data that's
really unimodal normal. The components aren't well distinguished from one
another and the degenerate solution of assigning all weight to one component should
win out. Priors on the mixing ratio can help prevent this problem.

What is the mixture ratio you're fitting?

- Bob

Reply all

Reply to author

Forward