Some more notes on lp__ in Stan and the computation of Bayes factors.
1. lp__ is the log probability of the unconstrained parameters,
so it includes the log Jacobian determinant of any variable transforms.
2. The probability functions when evaluated in the generated
quantities block will NOT drop the proportional calculations.
So conceivably, if you had binomial data and wanted to compute
the evidence (i.e., data probability relative to the prior)
for some data in a binomial model with prior Beta(100,100) on
the chance of success parameter theta, then you could do it like this:
data {
int<lower=0> N; // # of items
int<lower=0> x[N]; // # of trials for item n
int<lower=0> y[N]; // # of successes for item n
}
parameters {
real<lower=0,upper=1> theta;
}
model {
theta ~ beta(100,100);
}
generated quantities {
real log_prob
log_prob <- 0;
for (n in 1:N)
log_prob <- log_prob + binomial_log(y[n],x[n],theta);
}
Stan will output values for log_prob. These are essentially
log likelihoods
log p(y|x,theta)
for the binomial model with theta sampled from
the prior, theta ~ beta(100,100).
The quantity we want is
p(y|M) = INT p(y|theta) p(theta|M) d.theta
which is approximated by
1/N SUM_{n in 1:N} p(y|theta(n))
if theta(n) are samples drawn from p(theta|M).
We can't quite just take an average of the log probs,
because log isn't a linear function, so that
mean log x <= log mean x
We can't simply convert to the linear scale because
the calculation will underflow for even modest sized
data sets.
But we can use the handy LOG_SUM_EXP function to robustly
compute the log evidence,
log p(y|M)
approx.
log (1/N SUM_{n in 1:N} p(y|theta(n))
= -log(N) + log SUM_{n in 1:N} p(y|theta(n))
= -log(N) + log sum_{n in 1:N} exp log p(y|theta(n))
= -log(N) + LOG_SUM_EXP_{n in 1:N} log p(y|theta(n))
But...
This isn't going to work in practice when the posterior
is concentrated relative to the prior. That's because we're
likely to miss the high probability mass area of the posterior
by randomly sampling from a more diffuse prior. And the problem
gets much worse with both dimensionality of the parameter
vector (because of how distance works) and the amount of data (because
of how posteriors concentrate relative to priors).
There's a discussion of this issue in section 4.2 of Kass and Raftery,
the observation of which they attribute to McCulloch and Rossi (1991):
http://www.stat.washington.edu/raftery/Research/PDF/kass1995.pdf
Anyway, we care more about the posterior and in particular,
posterior predictive inferences, than we care about comparing
evidence (in the technical sense) for models. So computing Bayes
factors isn't on our to-do list (it's rare I can say that about
a textbook topic in Bayesian stats).
- Bob