Meaning of lp__ and model complexity

Bob Carpenter

unread,

Mar 10, 2013, 5:20:20 PM3/10/13

to stan-...@googlegroups.com

Forwarding from Rudolf Cardinal (with permission):

I too am trying to grasp this and very probably being dense, but...

/Bayes factor. /For a model comparison between models M1 and M2 with data D, the Bayes factor is, as I understand it
P(M1 | D) / P(M2 | D) = P(D | M1)P(M1) / P(D | M2)P(M2), which for equal priors, P(M1) = P(M2), reduces to the ratio of
model evidences, P(D | M1) / P(D | M2).

/Model evidence. /Now, as I understand it, for a given model M that has parameters T, the evidence is P(D | M) =
integral.dT.P(D | T,M).P(T | M), i.e. integrating out specific values of the parameters T (and the process of taking P(D
| M) also penalizes complex models appropriately; Daw 2011, MacKay 2003, chapter 28).

/Approximating model evidence by sampling. /I gather than an approximation to this integral is the mean of a large
number K samples of P(D | T,M).P(T | M), as long as those K samples are sampled according to the probability
distribution of the parameters T; i.e. P(D | M) is approximately (1 / K) * sum(for j=1:K)[ P(D | Tj,M).P(Tj | M) ].

So if Stan's lp__ variable is "the joint log probability, log p(params,data)..." for each sample, and if that sampling
(by Stan's MCMC method) is according to the distribution of the parameters, then is not the mean value of lp__ from a
whole set of samples (as e.g. summarized by RStan) a good approximation to the model evidence P(D | M)?

Empirically, it appears not: I used the example from Krushke (2011 "Doing Bayesian Analysis" fig. 5.3: coin flips with 7
heads 7 tails and priors of either beta(1,1) [uniform] or beta(100,100)) and it certainly gave different answers;
R/RStan code at the bottom; the commented "print" statements were to view the lp__ update as it progresses.

Krushke (2011, p137) uses a mean-of-samples approach to estimate P(D | M), written there as P(D), and gives the
practical reason for not doing it with a Metropolis MCMC algorithm as the fact that the number of samples for this
method to converge is huge. However, changing from 2000 to 100,000 iterations with Stan (same code as below) makes no
difference.

So I presume the "problem" is that lp__ is not accumulating log( p(data | parameters) ) + log( p(parameters) ) -- but it
certainly seems to start with something related to log( p(parameters) ) and adds log( p(data | parameters) ) every time
you "fit" a data point -- but what is it doing that's different? Is the clue in the words "up to a proportion", and I'm
failing to grasp that?

I fear I am contributing merely to the examples of common misunderstandings but would be very interested to know
if/where this goes wrong!

Many thanks and all the best,
Rudolf.

|
explore_meaning_of_lp <- function() {
coindata = list(
y = c(rep(0,7),rep(1,7)),
N = 14
)
model1 = '
data {
int<lower=1> N;
int<lower=0,upper=1> y[N];
}
parameters {
real<lower=0,upper=1> theta;
}
model {
real previous_lp;
real lp_diff;
//print("on this iteration, theta = ", theta);
//print("Start: lp__=", lp__); // is zero when parameters unconstrained; non-zero for truncated real
theta ~ beta(1,1); // uniform prior
//print("parameter prior applied: lp__=", lp__, ", or p=", exp(lp__));
for (t in 1:N) {
previous_lp <- lp__;
y[t] ~ bernoulli(theta); // fit the data
lp_diff <- lp__ - previous_lp;
//print("single point fitting (", y[t], ") added LP of ", lp_diff, ", equivalent to p=", exp(lp_diff));
}
//print("y fitted: lp__=", lp__, ", or p=", exp(lp__));
}
'
# Model 2 is the same as previous model, except for prior
model2 = '
data {
int<lower=1> N;
int<lower=0,upper=1> y[N];
}
parameters {
real<lower=0,upper=1> theta;
}
model {
real previous_lp;
real lp_diff;
//print("on this iteration, theta = ", theta);
//print("Start: lp__=", lp__); // is zero when parameters unconstrained; non-zero for truncated real
theta ~ beta(100,100);
//print("parameter prior applied: lp__=", lp__, ", or p=", exp(lp__));
for (t in 1:N) {
previous_lp <- lp__;
y[t] ~ bernoulli(theta); // fit the data
lp_diff <- lp__ - previous_lp;
//print("single point fitting (", y[t], ") added LP of ", lp_diff, ", equivalent to p=", exp(lp_diff));
}
//print("y fitted: lp__=", lp__, ", or p=", exp(lp__));
}
'
fit1 = stan(model_code = model1, data = coindata, iter = 2000, seed = 1234, chains = 4)
# theta mean = 0.50098
# lp__ mean = -11.59037 (equivalent to p = 9.25e-6)
# HOWEVER, P(D | M) should be 1.94e-5 (Krushke 2011 p89)
# 100,000 iterations: lp__ mean = -11.60721
fit2 = stan(model_code = model2, data = coindata, iter = 2000, seed = 1234, chains = 4)
# theta mean = 0.49974
# lp__ mean = -148.85339 (equivalent to p = 2.25e-65)
# HOWEVER, P(D | M) should be 5.9e-5 (Krushke 2011 p89)
# 100,000 iterations: lp__ mean = -148.83828
return(list(fit1, fit2))
}

|

Bob Carpenter

unread,

Mar 10, 2013, 6:11:45 PM3/10/13

to stan-...@googlegroups.com

(Does somebody want to take a whack at writing up all of
this stuff (DIC, Bayes Factors, etc. etc.) for the manual?

I'm not that familiar with Bayes factors, but the issue
you may be having is that our lp__ is not normalized -- it
drops all the constant terms. For instance, the statement

model {
...
theta ~ beta(100,100);
...
}

drops the Gamma terms in the density for Beta because they're
constant.

More inline.

> Forwarding from Rudolf Cardinal (with permission):
>
>
> I too am trying to grasp this and very probably being dense, but...
>
> /Bayes factor. /For a model comparison between models M1 and M2 with data D, the Bayes factor is, as I understand it
> P(M1 | D) / P(M2 | D) = P(D | M1)P(M1) / P(D | M2)P(M2), which for equal priors, P(M1) = P(M2), reduces to the ratio of
> model evidences, P(D | M1) / P(D | M2).

> /Model evidence. /Now, as I understand it, for a given model M that has parameters T, the evidence is P(D | M) =
> integral.dT.P(D | T,M).P(T | M), i.e. integrating out specific values of the parameters T (and the process of taking P(D
> | M) also penalizes complex models appropriately; Daw 2011, MacKay 2003, chapter 28).

But this is w.r.t. the prior, p(T|M), not the posterior,
p(T|M,D). It's the latter that's sampled by Stan.

> /Approximating model evidence by sampling. /I gather than an approximation to this integral is the mean of a large
> number K samples of P(D | T,M).P(T | M), as long as those K samples are sampled according to the probability
> distribution of the parameters T; i.e. P(D | M) is approximately (1 / K) * sum(for j=1:K)[ P(D | Tj,M).P(Tj | M) ].

Right. But here p(T|M) is the prior density, not
the posterior.

Also, you don't want the p(Tj|M) term for the j-th sample.
That's built into the sampling. So if you have samples
T(1)...T(N) drawn according to P(T|M), then you have

P(D|M) approx 1/N SUM_{n=1}^N P(D|T(n),M)

But as I've said, our model doesn't compute P(D|T(n),M),
it computes an unnormalized p(D,T(n)|M).

> So if Stan's lp__ variable is "the joint log probability, log p(params,data)..." for each sample, and if that sampling
> (by Stan's MCMC method) is according to the distribution of the parameters, then is not the mean value of lp__ from a
> whole set of samples (as e.g. summarized by RStan) a good approximation to the model evidence P(D | M)?

It's only the joint up to a proportion (dropping constant terms).
These will vary across different model structures.

> Empirically, it appears not: I used the example from Krushke (2011 "Doing Bayesian Analysis" fig. 5.3: coin flips with 7
> heads 7 tails and priors of either beta(1,1) [uniform] or beta(100,100)) and it certainly gave different answers;
> R/RStan code at the bottom; the commented "print" statements were to view the lp__ update as it progresses.
>
> Krushke (2011, p137) uses a mean-of-samples approach to estimate P(D | M), written there as P(D), and gives the
> practical reason for not doing it with a Metropolis MCMC algorithm as the fact that the number of samples for this
> method to converge is huge. However, changing from 2000 to 100,000 iterations with Stan (same code as below) makes no
> difference.
>
> So I presume the "problem" is that lp__ is not accumulating log( p(data | parameters) ) + log( p(parameters) ) -- but it
> certainly seems to start with something related to log( p(parameters) ) and adds log( p(data | parameters) ) every time
> you "fit" a data point -- but what is it doing that's different? Is the clue in the words "up to a proportion", and I'm
> failing to grasp that?

I think so. Hopefully someone will write up the manual
section on all of this!

- Bob

Ben Goodrich

unread,

Mar 10, 2013, 7:26:14 PM3/10/13

to stan-...@googlegroups.com

On Sunday, March 10, 2013 6:11:45 PM UTC-4, Bob Carpenter wrote:

I'm not that familiar with Bayes factors, but the issue
you may be having is that our lp__ is not normalized -- it
drops all the constant terms. For instance, the statement

model {
...
theta ~ beta(100,100);
...
}

drops the Gamma terms in the density for Beta because they're
constant.

Maybe Stan should calculate lp__ once at the end with doubles and propto as False? Then the difference between that and the last draw gives the constants collectively, which could then retroactively be added to all realizations of lp__ if normalized calculations are needed.

Ben

Bob Carpenter

unread,

Mar 10, 2013, 7:49:02 PM3/10/13

to stan-...@googlegroups.com

Some more notes on lp__ in Stan and the computation of Bayes factors.

1. lp__ is the log probability of the unconstrained parameters,
so it includes the log Jacobian determinant of any variable transforms.

2. The probability functions when evaluated in the generated
quantities block will NOT drop the proportional calculations.

So conceivably, if you had binomial data and wanted to compute
the evidence (i.e., data probability relative to the prior)
for some data in a binomial model with prior Beta(100,100) on
the chance of success parameter theta, then you could do it like this:

data {
int<lower=0> N; // # of items
int<lower=0> x[N]; // # of trials for item n
int<lower=0> y[N]; // # of successes for item n

}
parameters {
real<lower=0,upper=1> theta;
}
model {

theta ~ beta(100,100);
}
generated quantities {
real log_prob
log_prob <- 0;
for (n in 1:N)
log_prob <- log_prob + binomial_log(y[n],x[n],theta);
}

Stan will output values for log_prob. These are essentially
log likelihoods

log p(y|x,theta)

for the binomial model with theta sampled from
the prior, theta ~ beta(100,100).

The quantity we want is

p(y|M) = INT p(y|theta) p(theta|M) d.theta

which is approximated by

1/N SUM_{n in 1:N} p(y|theta(n))

if theta(n) are samples drawn from p(theta|M).

We can't quite just take an average of the log probs,
because log isn't a linear function, so that

mean log x <= log mean x

We can't simply convert to the linear scale because
the calculation will underflow for even modest sized
data sets.

But we can use the handy LOG_SUM_EXP function to robustly
compute the log evidence,

log p(y|M)

approx.

log (1/N SUM_{n in 1:N} p(y|theta(n))

= -log(N) + log SUM_{n in 1:N} p(y|theta(n))

= -log(N) + log sum_{n in 1:N} exp log p(y|theta(n))

= -log(N) + LOG_SUM_EXP_{n in 1:N} log p(y|theta(n))

But...

This isn't going to work in practice when the posterior
is concentrated relative to the prior. That's because we're
likely to miss the high probability mass area of the posterior
by randomly sampling from a more diffuse prior. And the problem
gets much worse with both dimensionality of the parameter
vector (because of how distance works) and the amount of data (because
of how posteriors concentrate relative to priors).

There's a discussion of this issue in section 4.2 of Kass and Raftery,
the observation of which they attribute to McCulloch and Rossi (1991):

http://www.stat.washington.edu/raftery/Research/PDF/kass1995.pdf

Anyway, we care more about the posterior and in particular,
posterior predictive inferences, than we care about comparing
evidence (in the technical sense) for models. So computing Bayes
factors isn't on our to-do list (it's rare I can say that about
a textbook topic in Bayesian stats).

- Bob

Bob Carpenter

unread,

Mar 10, 2013, 7:54:51 PM3/10/13

to stan-...@googlegroups.com

On 3/10/13 7:26 PM, Ben Goodrich wrote:
> On Sunday, March 10, 2013 6:11:45 PM UTC-4, Bob Carpenter wrote:
>
> I'm not that familiar with Bayes factors, but the issue
> you may be having is that our lp__ is not normalized -- it
> drops all the constant terms. For instance, the statement
>
> model {
> ...
> theta ~ beta(100,100);
> ...
> }
>
> drops the Gamma terms in the density for Beta because they're
> constant.
>
>
> Maybe Stan should calculate lp__ once at the end with doubles and propto as False?

That's an interesting idea. We could make it
an option and spit it out as another "parameter".
It'll take a bit of work on the code generator side
and on the command side, but it's not that big a deal.

Do we really care about it, though?

I don't want to just add features because we can. The
combinatorics of their interactions will inevitably weigh down future
development unless they can be neatly factored.

> Then the difference between that and
> the last draw gives the constants collectively, which could then retroactively
> be added to all realizations of lp__ if
> normalized calculations are needed.

The regular versions of lp__ also include the Jacobian adjustments,
because they're on the unconstrained scale.

I think a final log prob value would be more useful for users
on the natural model scale. But then we couldn't just do the
clever diff.

I've also been thinking we could instead of a single
model block have a model that separated the likelihood and
prior terms. Then we could do something more like DIC.
The issue there is that DIC also wants the value of the log
likelihood at the posterior mean, which isn't easy to get
within Stan. It could be done on the R side by calculating
the mean and then feeding that back into the model. It
could even be done by a user now that we can access the log
prob function directly.

- Bob

Ben Goodrich

unread,

Mar 10, 2013, 8:52:23 PM3/10/13

to stan-...@googlegroups.com

On Sunday, March 10, 2013 7:54:51 PM UTC-4, Bob Carpenter wrote:

> Then the difference between that and
> the last draw gives the constants collectively, which could then retroactively
> be added to all realizations of lp__ if
> normalized calculations are needed.

The regular versions of lp__ also include the Jacobian adjustments,
because they're on the unconstrained scale.

I think a final log prob value would be more useful for users
on the natural model scale. But then we couldn't just do the
clever diff.

Does the presence of the Jacobian adjustments mess up the model comparison calculations that people want to do? In other words, can't they just do (some) model comparisons on the unbounded parameter space if they add the constants back in?

As to whether people want to print an lp__ with or without constants, I don't think they should care. But it would be kind of useful to be able to compare with a log-likelihood that is calculated in R, which would include the constants.

Ben

Michael Betancourt

unread,

Mar 11, 2013, 3:38:43 AM3/11/13

to stan-...@googlegroups.com

> (Does somebody want to take a whack at writing up all of
> this stuff (DIC, Bayes Factors, etc. etc.) for the manual?

I'm happy to write something up about Bayes factors/model evidence but I don't feel comfortable quite yet with the various information criterion to talk about them with any generality (they're not hard to present as definitions, but the assumptions that go into them are often hard to understand).

Most importantly, NEVER try to calculate the model evidence using posterior draws. You're essentially calculating the Monte Carlo estimate of the function f(alpha) = 1, which has infinite Monte Carlo variance -- it's the one estimate you can't get with MCMC alone. Stable, accurate estimates of the evidence require more sophisticated techniques like nested sampling which could be coded in Stan only after adding some new features (non-holonomic, multivariate constraints -- this is where bouncing is a requirement).

-Mike

Bob Carpenter

unread,

Mar 11, 2013, 4:07:11 AM3/11/13

to stan-...@googlegroups.com

On 3/10/13 8:52 PM, Ben Goodrich wrote:
> On Sunday, March 10, 2013 7:54:51 PM UTC-4, Bob Carpenter wrote:
>
> > Then the difference between that and
> > the last draw gives the constants collectively, which could then retroactively
> > be added to all realizations of lp__ if
> > normalized calculations are needed.
>
> The regular versions of lp__ also include the Jacobian adjustments,
> because they're on the unconstrained scale.
>
> I think a final log prob value would be more useful for users
> on the natural model scale. But then we couldn't just do the
> clever diff.
>
>
> Does the presence of the Jacobian adjustments mess up the model comparison calculations that people want to do?

I have no idea. Andrew's been spending a lot of time
thinking about these things, so I can ask him during
the meeting.

> In other
> words, can't they just do (some) model comparisons on the unbounded parameter space if they add the constants back in?

I don't know where the parameterization is going to matter.
I'd think it'd all just cancel out if the measures made sense.

> As to whether people want to print an lp__ with or without constants, I don't think they should care. But it would be
> kind of useful to be able to compare with a log-likelihood that is calculated in R, which would include the constants.

In general, a Stan model only defines a quantity proportional
to the log posterior up to an additive constant. You'd have to define
the log likelihood explicitly to be assured to get it. Even if you
the specification was of a log joint, you'd need to separate
the calculations of the prior and likelihood.

The comparison calcs under DIC additionally require
the model log likelihood at the posterior mean. And for WAIC,
we need the log likelihoods of individual data points.

So I think the jury's still out on how best to do all of this.

- Bob

Andrew Gelman

unread,

Mar 11, 2013, 5:23:52 PM3/11/13

to stan-...@googlegroups.com

Hi, just a few things.
1. I think the marginal probability of the data is overrated. And it's basically meaningless when people use flat priors (all those N(0,100000) things that people keep sending us).
2. Mike's right that you can't compute p(y) from the simulations. What you can do (sometimes) is compute p(y|M1)/p(y|M2) based on simulations from model 1 and simulations from model 2 (that's called bridge sampling in statistics and something else in physics) and to compute relative values of p(y|M), where "M" is some continuously-parameterized family of models (that's called path sampling in statistics and I-don't-know-what in physics, and has connections to simulated tempering).
3. BIC is not p(y). BIC is bullshit (as Ben might say, and I agree with his phrasing on this one).
4. Right now I prefer WAIC to DIC. There's also cross-validation. All these require additional info to subdivide the data into pieces (or, in the case of DIC, to subdivide data from parameters and to feed in a point estimate).
A

> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

Reply all

Reply to author

Forward