batch of coins and Bayes factor: puzzling results

723 views

Emanuele Olivetti

Feb 14, 2011, 11:05:27 AM2/14/11
Hi everyone,

I am investigating hypothesis testing with the Bayes factor. I have
two competing models, M0 and M1. The Bayes factor B_10 =
p(observations|M1)/p(observations|M0) is the evidence in favor of M1
with respect to M0 [0]. I've implemented the models with PyMC but I am
getting unstable/puzzling results that's why I asking you for feedback
on the implementation as well as on the whole idea.

Here is a short abstract description of the problem: there is a coin
factory which is suspected to produce biased/loaded coins. For this
specific problem assume that the possible bias is expected to occur
towards one specific side of the coin only, say tail. The two
competing hypothesis are: H0 = the factory produces fair coins, H1 =
the factory produces coins that turn tail too frequently so they are
not fair.

In order to test the hypotheses we get N coins from the factory, say
N=10, and we toss each coin n times, say n=50. The observations
{h_1,...,h_N} are then the N numbers of heads obtained by tossing each
of the N coins m times. Let p_i be the (true) probability to get head
from the i-th coin. The two competing models/hypotheses I want to test
are basically Beta-Binomial with different priors:

M0/H0 : the number of heads h_i of each coin (i=1...N) is
binomially(n,p_i) distributed. All {p_i}_i are drawn from the same
Beta distribution. The mean 'mu' of the Beta is exactly 0.5 (ie. it is
expected that p_i=0.5, ie. the number of heads is the same as that of
tails, on average) and the variance sigma^2 is unknown. As far as I
read from Gelman 2006 [0], a uniform prior on sigma (not sigma2)
should be a good choice for a non-informative prior on variance.

M1/H1 : the number of heads h_i of each coin (i=1...N) is
binomially(n,p_i) distributed. All {p_i}_i are drawn from the same
Beta distribution. The mean 'mu' of the Beta lies in [0,0.5) (ie. it
is expected that p_i<0.5, ie. the number of tails is greater than that
of heads) and the variance sigma^2 is unknown. Non-informative priors
for mu and sigma are again uniform distributions.

Attached you can find an implementation of the two models and an
example with the computation of the Bayes factor. In that example the
coins turn head quite visibly less than half of the times so the Bayes
factor B_10 should be pretty big. Unexpectedly it is low and unstable,
like B_10<3 (and sometimes below 1) despite the generous amount of
sampling iterations.

Best,

Emanuele

[0]: Bayes Factors, Kass and Raftery, JASA, 1995.
[1]: Prior distributions for variance parameters in hierarchical
models, Gelman, Bayesian Analysis, 2006.

batch_of_coins.py

Chris Fonnesbeck

Feb 16, 2011, 12:15:08 PM2/16/11
to PyMC
Hi Emanuele,

This is a nice illustrative problem, but I don't think you are
calculating your Bayes factor properly. The BF is the posterior odds
of two models divided by the prior odds, based on this relationship,
which is given in Kass and Raftery:

posterior odds = BF x prior odds

In your calculations, I see neither prior nor posterior probabilities
of each model, which are required to calculate the BF. The posterior
mean of the likelihood values does not give you enough information to
calculate the Bayes factor. The posterior probabilities of the models
can be calculated in a few different ways, while the prior
probabilities of the models are specified by you.

My recent blog post on this might be helpful:

http://stronginference.squarespace.com/weblog/2010/12/16/estimating-bayes-factors-using-pymc.html

cf

Thomas Wiecki

Feb 16, 2011, 2:23:16 PM2/16/11
Hi Chris,

thanks, I haven't seen that blog post. One question about that. How
would you have calculated the BF if you had created two separate
models instead of using the Categorial?

Just FYI, there is a nice short-cut to computing the BF for hypothesis
testing of a prior belief against the posterior belief (i.e. how much
did our certainty in a particular hypothesis change by seeing the
data) called the Savage-Dickey ratio test on which EJ Wagenmakers has
a paper (Wagenmakers et al 2010). Essentially, if one wants to test if
mu==5 the computation of the corresponding BF becomes: BF =
posterior(mu=5) / prior(mu=5). I have some code if you are interested.

-Thomas

> --
> You received this message because you are subscribed to the Google Groups "PyMC" group.
> To post to this group, send email to py...@googlegroups.com.
> To unsubscribe from this group, send email to pymc+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/pymc?hl=en.
>
>

Chris Fonnesbeck

Feb 16, 2011, 3:35:03 PM2/16/11
to PyMC
On Feb 16, 1:23 pm, Thomas Wiecki <thomas.wie...@googlemail.com>
wrote:
> thanks, I haven't seen that blog post. One question about that. How
> would you have calculated the BF if you had created two separate
> models instead of using the Categorial?
>

It gets tricky. You have to have some way of calculating the marginal
posterior probability of each model. One other way is to do reversible
jump MCMC (Green 1995), but this is not easy to implement in a general
way. Also, I suppose you could generate DIC model weights for certain
models. I think this approach is the easiest, as long as there aren't
many models.

> Just FYI, there is a nice short-cut to computing the BF for hypothesis
> testing of a prior belief against the posterior belief (i.e. how much
> did our certainty in a particular hypothesis change by seeing the
> data) called the Savage-Dickey ratio test on which EJ Wagenmakers has
> a paper (Wagenmakers et al 2010). Essentially, if one wants to test if
> mu==5 the computation of the corresponding BF becomes: BF =
> posterior(mu=5) / prior(mu=5). I have some code if you are interested.

cf

John McFarland

Feb 16, 2011, 4:40:00 PM2/16/11
to PyMC
On Feb 16, 2:35 pm, Chris Fonnesbeck <fonnesb...@gmail.com> wrote:
> On Feb 16, 1:23 pm, Thomas Wiecki <thomas.wie...@googlemail.com>
> wrote:
>
> > thanks, I haven't seen that blog post. One question about that. How
> > would you have calculated the BF if you had created two separate
> > models instead of using the Categorial?
>
> It gets tricky. You have to have some way of calculating the marginal
> posterior probability of each model. One other way is to do reversible
> jump MCMC (Green 1995), but this is not easy to implement in a general
> way. Also, I suppose you could generate DIC model weights for certain
> models. I think this approach is the easiest, as long as there aren't
> many models.

Would you not be able to set up a deterministic node or trace to track
the likelihood values under each model: P(D|M_i,theta_i)? Then the
MCMC would collect samples of the likelihood and the integration is
performed by taking an expectation.

John

Thomas Wiecki

Feb 16, 2011, 5:42:16 PM2/16/11
On Wed, Feb 16, 2011 at 3:35 PM, Chris Fonnesbeck <fonne...@gmail.com> wrote:
> On Feb 16, 1:23 pm, Thomas Wiecki <thomas.wie...@googlemail.com>
> wrote:
>> thanks, I haven't seen that blog post. One question about that. How
>> would you have calculated the BF if you had created two separate
>> models instead of using the Categorial?
>>
>
> It gets tricky. You have to have some way of calculating the marginal
> posterior probability of each model. One other way is to do reversible
> jump MCMC (Green 1995), but this is not easy to implement in a general
> way. Also, I suppose you could generate DIC model weights for certain
> models. I think this approach is the easiest, as long as there aren't
> many models.

Why would calculating DIC be a problem when there are many models?
Also, do you think that DIC (when creating separate models) and BF
(when creating one model) have a direct relationship? Ideally they
should I believe.

>
>> Just FYI, there is a nice short-cut to computing the BF for hypothesis
>> testing of a prior belief against the posterior belief (i.e. how much
>> did our certainty in a particular hypothesis change by seeing the
>> data) called the Savage-Dickey ratio test on which EJ Wagenmakers has
>> a paper (Wagenmakers et al 2010). Essentially, if one wants to test if
>> mu==5 the computation of the corresponding BF becomes: BF =
>> posterior(mu=5) / prior(mu=5). I have some code if you are interested.
>

I haven't tested this in a while and cleaned it up a little, hope it
works. But it should be fairly self-explanatory:

def interpolate_trace(x, trace, range=(-1,1), bins=100):
"""Create a histogram over a trace and interpolate to get a
smoothed distribution."""
import scipy.interpolate

x_histo = np.linspace(range[0], range[1], bins)
histo = np.histogram(trace, bins=bins, range=range, normed=True)[0]
interp = scipy.interpolate.InterpolatedUnivariateSpline(x_histo, histo)(x)

return interp

def savage_dickey(pos, post_trace, range=(-1,1), bins=100,
prior_trace=None, prior_y=None):
"""Calculate Savage-Dickey density ratio test, see Wagenmakers et
al. 2010 at http://dx.doi.org/10.1016/j.cogpsych.2009.12.001

Arguments:
**********
pos<float>: position at which to calculate the savage dickey ratio
at (i.e. the specific hypothesis you want to test)
post_trace<numpy.array>: trace of the posterior distribution

Keyword arguments:
******************
prior_trace<numpy.array>: trace of the prior distribution
prior_y<numpy.array>: prior density at each point (must match
range and bins)
range<(int,int)>: Range over which to interpolate and plot
bins<int>: Over how many bins to compute the histogram over

IMPORTANT: Supply either prior_trace or prior_y.
"""
x = np.linspace(range[0], range[1], bins)

if prior_trace is not None:
# Prior is provided as a trace -> histogram + interpolate
prior_pos = interpolate_trace(pos, prior_trace, range=range, bins=bins)

elif prior_y is not None:
# Prior is provided as a density for each point -> interpolate
to retrieve positional density
import scipy.interpolate
prior_pos = scipy.interpolate.InterpolatedUnivariateSpline(x,
prior_y)(pos)
else:
assert ValueError, "Supply either prior_trace or prior_y
keyword arguments"

# Histogram and interpolate posterior trace at SD position
posterior_pos = interpolate_trace(pos, post_trace, range=range, bins=bins)

# Calculate Savage-Dickey density ratio at pos
sav_dick = posterior_pos / prior_pos

return sav_dick

John McFarland

Feb 16, 2011, 6:00:00 PM2/16/11
to PyMC
On Feb 16, 1:23 pm, Thomas Wiecki <thomas.wie...@googlemail.com>
wrote:
> Hi Chris,
>
> thanks, I haven't seen that blog post. One question about that. How
> would you have calculated the BF if you had created two separate
> models instead of using the Categorial?
>

Chris,

Sort of an extension of Thomas's question, but would your categorical
variable approach work at all if you were comparing models that did
not share parameters? It seems to me that it would not, since your
likelihood function only depends on the parameters of the model
currently indicated by true_model.

I threw together some code to test my previous suggestion of tracking
each model's likelihood with a deterministic node. It appears to work
in pymc, but even with 50,000 samples or so, the Bayes factor is not
very repeatable. Probably this approach is ill-conditioned due to
taking the mean of likelihood values, which tend to be very small.
Maybe there is a way of working directly with the log-likelihood
values?

John

John McFarland

Feb 16, 2011, 6:42:44 PM2/16/11
to PyMC
On Feb 16, 5:00 pm, John McFarland <mcfar...@gmail.com> wrote:
> I threw together some code to test my previous suggestion of tracking
> each model's likelihood with a deterministic node.  It appears to work
> in pymc, but even with 50,000 samples or so, the Bayes factor is not
> very repeatable.  Probably this approach is ill-conditioned due to
> taking the mean of likelihood values, which tend to be very small.
> Maybe there is a way of working directly with the log-likelihood
> values?
>
> John

*Correction*: Actually, the Bayes factor does converge just fine with
this approach (I had a bug where I was computing it based on the
priors for theta instead of posteriors...) However, it is
significantly different than what is obtained using Chris's original
code in which the models share a common variable. Chris reported a
Bayes factor of about 14 in favor of the geometric model. By
comparing the likelihoods directly, with separate parameters for each
model, I obtain a Bayes factor of about 5 in favor of the geometric
model.

One thing I notice is that if each model is given its own parameter,
their posterior distributions are quite different (with the mean for
the geometric model showing much more posterior variance). To me,
this suggests that using the same parameter for both models is not
appropriate...

Code is below:

from pymc import Uniform, Lambda, geometric_like, poisson_like
from numpy import exp, log, array, ones
import pymc

# Data
Y = array([0,1,2,3,8])

# Poisson mean
mu_p = Uniform('mu_p', lower=0, upper=1000, value=4)

# Geometric mean
mu_g = Uniform('mu_g', lower=0, upper=1000, value=4)

# Geometric probability
p = Lambda('p', lambda mu_g=mu_g: 1/(1+mu_g))

Y_p = pymc.Poisson('Y_p', mu_p, value=Y, observed=True)
Y_g = pymc.Geometric('Y_g', p, value=Y+1, observed=True)

@pymc.deterministic
def GL(p=p):
return exp(geometric_like(Y+1, p))

@pymc.deterministic
def PL(mu=mu_p):
return exp(poisson_like(Y, mu))

S = pymc.MCMC()
S.sample(10000,500,2)

print "BF:", S.GL.stats()['mean'] / S.PL.stats()['mean']

Chris Fonnesbeck

Feb 16, 2011, 9:17:39 PM2/16/11
to PyMC
Interesting approach, trying to calculate BF without model
probabilities, but I'm not sure you are getting Bayes factors there.
In fact, I cannot replicate my result using a shared mean -- I get a
value of about 6.9, which is way too low. I'll keep poking around.

Chris Fonnesbeck

Feb 16, 2011, 9:28:05 PM2/16/11
to PyMC
One issue is that you are supposed to use the harmonic mean (see Kass
and Raftery section 4.3). If you do this:

print "BF:", scipy.stats.hmean(S.trace('GL')[:]) /
stats.hmean(S.trace('PL')[:])

you get a BF of around 11 for the common mean.

cf

Chris Fonnesbeck

Feb 16, 2011, 9:37:24 PM2/16/11
to PyMC
On Feb 16, 8:28 pm, Chris Fonnesbeck <fonnesb...@gmail.com> wrote:
> One issue is that you are supposed to use the harmonic mean (see Kass
> and Raftery section 4.3). If you do this:
>
> print "BF:", scipy.stats.hmean(S.trace('GL')[:]) /
> stats.hmean(S.trace('PL')[:])
>
> you get a BF of around 11 for the common mean.

Second issue: you need PL and GL to include the prior for mu, since
they are included in the integral for the marginal likelihood. If you
do this, you get a BF of around 13, just like my original code. Here's
the whole model:

mu = Uniform('mu', lower=0, upper=1000, value=4)

# Geometric probability
p = Lambda('p', lambda mu=mu: 1/(1+mu))

Y_p = pymc.Poisson('Y_p', mu, value=Y, observed=True)
Y_g = pymc.Geometric('Y_g', p, value=Y+1, observed=True)

@pymc.deterministic
def GL(p=p):
return exp(geometric_like(Y+1, p) + uniform_like(mu, 0, 1000))

@pymc.deterministic
def PL(mu=mu):
return exp(poisson_like(Y, mu) + uniform_like(mu, 0, 1000))

S = pymc.MCMC()
S.sample(10000,5000,1)

print "BF:", stats.hmean(S.trace('GL')[:]) / stats.hmean(S.trace('PL')
[:])

John McFarland

Feb 16, 2011, 11:03:44 PM2/16/11
to PyMC
Chris, I don't understand what you mean "without model
probabilities." The Bayes factor is just a ratio of marginal
likelihoods, right? (http://en.wikipedia.org/wiki/Bayes_factor)

John McFarland

Feb 16, 2011, 11:09:33 PM2/16/11
to PyMC
I'm not familiar with that reference, but I am simply trying to
compute the Bayes factor based on its mathematical definition. The
marginal likelihood is the integration of [P(D|theta_i, M_i) *
P(theta_i|M_i)] with respect to theta_i, which is by definition the
expectation of P(D|theta_i, M_i) with respect to the distribution
P(theta_i|M_i). I am just taking P(theta_i|M_i) to be the posterior
distribution for theta_i under M_i.

John McFarland

Feb 16, 2011, 11:17:30 PM2/16/11
to PyMC
On Feb 16, 8:37 pm, Chris Fonnesbeck <fonnesb...@gmail.com> wrote:
> On Feb 16, 8:28 pm, Chris Fonnesbeck <fonnesb...@gmail.com> wrote:
>
> > One issue is that you are supposed to use the harmonic mean (see Kass
> > and Raftery section 4.3). If you do this:
>
> > print "BF:", scipy.stats.hmean(S.trace('GL')[:]) /
> > stats.hmean(S.trace('PL')[:])
>
> > you get a BF of around 11 for the common mean.
>
> Second issue: you need PL and GL to include the prior for mu, since
> they are included in the integral for the marginal likelihood. If you
> do this, you get a BF of around 13, just like my original code. Here's
> the whole model:
>

Chris, I don't agree with this. See my previous post, but
mathematically the integral of [H(x)p(x)] with respect to the
parameter x is the expectation of H(x). In this context, this means
the marginal likelihood, L(Y|M_i), is the expectation of the
conditional likelihood over the distribution p(theta_i|M_i). This
means the marginal likelihood is the expectation of the conditional
likelihood, L(Y|theta_i,M_i). The code you have posted is instead
computing the expectation (in a "harmonic" sense, which I'm still
unclear on) of [L(Y|theta_i,M_i)*Prior(theta_i)], which is
mathematically Int[L(Y|
theta_i,M_i)*Prior(theta_i)*Posterior(theta_i)]d_theta_i, since your
Monte Carlo samples are drawn from the posterior for theta_i.

Chris Fonnesbeck

Feb 16, 2011, 11:52:06 PM2/16/11
to PyMC

On Feb 16, 10:03 pm, John McFarland <mcfar...@gmail.com> wrote:
> Chris, I don't understand what you mean "without model
> probabilities."  The Bayes factor is just a ratio of marginal
> likelihoods, right? (http://en.wikipedia.org/wiki/Bayes_factor)

Right, I was talking about calculating them from the ratio of the
posterior odds divided by the prior odds, which involves model
probabilities.

Chris Fonnesbeck

Feb 17, 2011, 12:00:12 AM2/17/11
to PyMC
On Feb 16, 10:17 pm, John McFarland <mcfar...@gmail.com> wrote:
> Chris, I don't agree with this.  See my previous post, but
> mathematically the integral of [H(x)p(x)] with respect to the
> parameter x is the expectation of H(x).

Yes, you're right. Well, I'm stumped then. Calculating the integral
directly should result in the same value as dividing the posterior
odds by the prior odds, but they do not appear to.

John McFarland

Feb 17, 2011, 12:06:05 AM2/17/11
to PyMC
Which results are you comparing against mine? The code you originally
posted doesn't give the same Bayes factor because by using a common
mu, I believe you aren't actually calculating the correct P(theta_i|
M_i) for each model. This is what I meant when I pointed out that if
you look at the posterior for mu_p and mu_g, separately under each of
the Poisson and geometric models, those posterior distributions are
different from each other, and so also different from the mixed (?)
posterior distribution for mu you formulated in your original code, in
which you only have one mu.

Chris Fonnesbeck

Feb 17, 2011, 12:08:54 AM2/17/11
to PyMC
> Which results are you comparing against mine?  The code you originally
> posted doesn't give the same Bayes factor because by using a common
> mu, I believe you aren't actually calculating the correct P(theta_i|
> M_i) for each model.  This is what I meant when I pointed out that if
> you look at the posterior for mu_p and mu_g, separately under each of
> the Poisson and geometric models, those posterior distributions are
> different from each other, and so also different from the mixed (?)
> posterior distribution for mu you formulated in your original code, in
> which you only have one mu.

I'm comparing my (that is, Link and Barker's) model with a common mu
to your model modified to have a common mu. I can get it close, using
the harmonic mean, but not up to 13.

John McFarland

Feb 17, 2011, 9:08:34 AM2/17/11
to PyMC
recreate your categorical modelling (in order to define Bayes factor
in terms of model probabilities), but I used separate mu parameters.
My approach was to create separate likelihood functions for Y|mu_m,
Y_mu_g, and Y|M, because we want the posterior distribution for the
parameters to be computed independently of which model is selected by
true_model. I guessed that the way to have the Y_M likelihood not try
and update mu_m and mu_g was to use ".value" to grab their current
values instead of giving them as arguments of the Y_M stochastic
(correct me if I'm off base here).

Appears to work, giving a Bayes factor of about 5, which matches my
marginal likelihood code...

# Data
Y = array([0,1,2,3,8])

# Prior model probabilities
pi = (0.1, 0.9)

# Poisson mean
mu_p = Uniform('mu_p', lower=0, upper=1000, value=4)

# Geometric mean
mu_g = Uniform('mu_g', lower=0, upper=1000, value=4)

# Geometric probability
p = Lambda('p', lambda mu=mu_g: 1/(1+mu))

Y_p = pymc.Poisson('Y_p', mu_p, value=Y, observed=True)
Y_g = pymc.Geometric('Y_g', p, value=Y+1, observed=True)

# Index to true model
true_model = Categorical("true_model", p=pi, value=0)

@observed
def Y_M(value=Y, M=true_model):
"""Either Poisson or geometric, depending on M"""
return geometric_like(value+1, p.value)*(M==0) or
poisson_like(value, mu_p.value)

S = pymc.MCMC()
S.sample(20000,500,2)

post_geometric = 1.0 - S.true_model.stats()['mean']

print "BF:", post_geometric/(1-post_geometric) / (pi[0] / pi[1])

Emanuele Olivetti

Feb 17, 2011, 9:39:21 AM2/17/11

Hi Chris,

First of all thanks a lot for spending time on my problem. I read
you blog post with much interest and I think I mostly understand
the way you explain for computing the Bayes factor.

My approach, correct or not, was to apply the definition of
Bayes factor as ratio of the data likelihoods under different models
and to estimate p(D|model_i) from traces of the MCMC sampling
of parameters ('p's, as I called them in my coins problem) either
with mean or even better with harmonic mean. My idea was
analogous to that of John posted in this same thread. Please correct
me if my implementation does not reflect this idea. I am still
learning! :-) . In case my approach is not wrong, have you any
clues for the instability of B_10? My latests guess is that I just
need to sample much more.

I have a question about your blog post: why the two models should
share the mu node? I see that Thomas and John asked related
questions but can't see clearly the motivation of this exact one.
Do I have to expect different results when using two different 'mu's?

Best,

Emanuele

Emanuele Olivetti

Feb 17, 2011, 10:11:16 AM2/17/11

This is a very interesting thread, thanks a lot to all of you for
explanations and examples.

I attempted a simple Monte Carlo to understand what Bayes factor
to expect. I get a result between 13 and 14 (using two different 'mu's,
but even using one the result does not change).

The very few lines of code are attached. It requires ~10 sec. to
run the 1e6 iterations on my slow laptop.

Best,

Emanuele

bayesfactor_eo.py

John McFarland

Feb 17, 2011, 10:40:22 AM2/17/11
to PyMC
>  bayesfactor_eo.py

Emanuele,

Your code computes the Bayes factor under the prior distribution for
the model parameters, which is probably not what you want.

John

John McFarland

Feb 17, 2011, 10:52:41 AM2/17/11
to PyMC
On Feb 17, 8:39 am, Emanuele Olivetti <emanu...@relativita.com> wrote:
> Hi Chris,
>
> First of all thanks a lot for spending time on my problem. I read
> you blog post with much interest and I think I mostly understand
> the way you explain for computing the Bayes factor.
>
> My approach, correct or not, was to apply the definition of
> Bayes factor as ratio of the data likelihoods under different models
> and to estimate p(D|model_i) from traces of the MCMC sampling
> of parameters ('p's, as I called them in my coins problem) either
> with mean or even better with harmonic mean. My idea was
> analogous to that of John posted in this same thread. Please correct
> me if my implementation does not reflect this idea. I am still
> learning! :-) . In case my approach is not wrong, have you any
> clues for the instability of B_10? My latests guess is that I just
> need to sample much more.

Emanuele: Chris may have some additional insight here, but from
looking quickly at your code, I *think* one problem is that you need
to account for the likelihood of the p_i values under each Beta
distribution model. Some quick trace plots seem to indicate that the
posterior distributions for p_i are basically the same under each
model. The difference in support under M0 and M1 should come from the
fact that the p_i's you are seeing are more likely under M1 than M0.
Specifically, I think you need to add this likelihood calculation in

>
> I have a question about your blog post: why the two models should
> share the mu node? I see that Thomas and John asked related
> questions but can't see clearly the motivation of this exact one.
> Do I have to expect different results when using two different 'mu's?
>
> Best,
>
> Emanuele

Chris may have a different opinion here, but I argue that the use of
the same mu node is simply incorrect. If you're unsure, I would
suggest considering more general cases: for example, what if you have
two models that can't be represented with the same parameters (for
example, a normal distribution and a triangular distribution)?

Chris Fonnesbeck

Feb 17, 2011, 12:01:42 PM2/17/11
to PyMC
On Feb 17, 9:52 am, John McFarland <mcfar...@gmail.com> wrote:
> Chris may have a different opinion here, but I argue that the use of
> the same mu node is simply incorrect.  If you're unsure, I would
> suggest considering more general cases: for example, what if you have
> two models that can't be represented with the same parameters (for
> example, a normal distribution and a triangular distribution)?

I do disagree with this. It may be a better model to have separate mu
values, but there is nothing inherently wrong with specifying them to
be the same. The unknown model that generated the data has an expected
value, and the MLE of that expected value will be the mean of the
data, at least for these sorts of simple models. Your comparison with
a triangular distribution is a red herring -- just because you can't
express its parameters as an expected value, does not mean that the
expected values under the normal and the triangular will not be the
same.

cf

Emanuele Olivetti

Feb 17, 2011, 12:03:44 PM2/17/11
John

Why not? According to the definition the Bayes factor is the ratio
of the densities p(data|model_1)/p(data|model_2) obtained by
integrating over the parameter space:
p(data|model_k) = \int p(data|\theta_k, model_k) \pi(\theta_k|model_k) d\theta_k
where \pi(\theta_k|model_k) is its prior density (see for example Kass and
Raftery JASA 1995).

The basic way to estimate p(data|model_k) is by simple Monte Carlo
estimation:
p(data|model_k) = 1/m \sum_{i=1}^m p(data|\theta^i)
where {\theta^i}_i is a sample from the prior distribution
(see for example Section 4.2 of the paper mentioned above).

My understanding is that my implementation follows directly
from the Bayes factor definition.

Best,

Emanuele

John McFarland

Feb 17, 2011, 12:15:28 PM2/17/11
to PyMC
Chris, the posterior distributions for mu_p and mu_g simply aren't the
same here (PyMC demonstrates this easily: plot the posteriors for mu_p
and mu_g in my example code). Estimation of mu is not independent of
the statistical model employed to describe the data.

Admittedly, my normal/triangular example is not a great one. Better
examples arise if we consider regression models. For example,
consider calculation of a Bayes factor for two models for E[Y|x] for a
covariate x: M0: alpha*x, and M1: beta*x**2. Each model has one
parameter, but the parameters simply aren't related to each other. As
the models for E[Y|x] become more complex, this becomes more and more
evident.

John McFarland

Feb 17, 2011, 12:23:28 PM2/17/11
to PyMC
On Feb 17, 11:03 am, Emanuele Olivetti <emanu...@relativita.com>
If I get a chance, I will try to track down this reference. However,
my understanding is that p(theta_i|model_i) is simply any distribution
for theta_i that you choose; ideally, it represents your present state
of knowledge, which in a Bayesian sense is the posterior distribution
for theta_i under model_i. If you average the likelihood with respect
to the prior, your theta_i values aren't informed by the data. This
would tell you which model better supports the data, under their prior
distributions for theta_i. However, since you would more likely apply
the models for prediction using posterior distributions for theta_i, I
would be more interested to know which model better supports the data
under the posterior distributions for theta_i.

John McFarland

Feb 17, 2011, 2:28:44 PM2/17/11
to PyMC
On Feb 17, 11:23 am, John McFarland <mcfar...@gmail.com> wrote:
>
> If I get a chance, I will try to track down this reference.  However,
> my understanding is that p(theta_i|model_i) is simply any distribution
> for theta_i that you choose; ideally, it represents your present state
> of knowledge, which in a Bayesian sense is the posterior distribution
> for theta_i under model_i.  If you average the likelihood with respect
> to the prior, your theta_i values aren't informed by the data.  This
> would tell you which model better supports the data, under their prior
> distributions for theta_i.  However, since you would more likely apply
> the models for prediction using posterior distributions for theta_i, I
> would be more interested to know which model better supports the data
> under the posterior distributions for theta_i.

Well, some quick literature review and I need to apologize Emanuele:
it looks like I was wrong and p(theta_i|model_i) is interpreted as a
prior distribution for theta_i. Sorry for all the confusion I'm
causing here. Conceptually it is a little difficult for me to accept
measuring the models based on prior distributions for theta, since I
have a preference for vague/noninformative/reference priors, which
appear to give rise to various problems when used in conjunction with
the calculation of Bayes factors.

See, for example, Wasserman (2000; discussion around Eq. (24)), where
he states that the use of noninformative prior distributions gives
rise to arbitrary constants in the Bayes factor definition. He gives
an approximation (Eq. 25) for the marginal likelihoods that does not
depend on the prior for theta.

Maybe we can at least agree that calculation of Bayes factors is
trickier than it first appears...

Wasserman, L., Bayesian Model Selection and Model Averaging," Journal
of Mathematical Psychology, Vol. 44, No. 1, 2000.

Kass, R. E. and Wasserman, L., A reference Bayesian test for nested
hypotheses and its relationship to the Schwarz criterion," Journal of
the American Statistical Assocaition, Vol. 90, 1995, pp. 928-934.

Emanuele Olivetti

Feb 18, 2011, 4:38:34 AM2/18/11
On 02/17/2011 08:28 PM, John McFarland wrote:
> Well, some quick literature review and I need to apologize Emanuele:
> it looks like I was wrong and p(theta_i|model_i) is interpreted as a
> prior distribution for theta_i. Sorry for all the confusion I'm
> causing here. Conceptually it is a little difficult for me to accept
> measuring the models based on prior distributions for theta, since I
> have a preference for vague/noninformative/reference priors, which
> appear to give rise to various problems when used in conjunction with
> the calculation of Bayes factors.
>
> See, for example, Wasserman (2000; discussion around Eq. (24)), where
> he states that the use of noninformative prior distributions gives
> rise to arbitrary constants in the Bayes factor definition. He gives
> an approximation (Eq. 25) for the marginal likelihoods that does not
> depend on the prior for theta.
>
> Maybe we can at least agree that calculation of Bayes factors is
> trickier than it first appears...
>
> Wasserman, L., Bayesian Model Selection and Model Averaging," Journal
> of Mathematical Psychology, Vol. 44, No. 1, 2000.
>
> Kass, R. E. and Wasserman, L., A reference Bayesian test for nested
> hypotheses and its relationship to the Schwarz criterion," Journal of
> the American Statistical Assocaition, Vol. 90, 1995, pp. 928-934.
>

John,

No need to apologize! The discussion is extremely interesting.

Indeed the choice of noninformative prior is tricky, in the sense
that it may not be clear what prior is really noninformative. In
my initial post I mentioned this paper by Gelman

Prior distributions for variance parameters in hierarchical
models, Gelman, Bayesian Analysis, 2006.
http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf

which discusses the choice of the prior for the variance of the
population in hierarchical models and provides very interesting
recommendations for noninformative priors. For example the paper
shows that commonly used priors (gamma etc.) are not appropriate.
It is my understanding that the problem of the batch of coins falls
within the range of this paper. [The problem proposed by Chris in
his blog post skips this issue.]

Thanks for the two references by Wasserman and Kass, I did not
know about them. Since I am spending quite a bit of time on the
Bayes factor lately good references are always warmly welcome!

For whoever is interested in the topic, here you can freely download
the paper about the Bayes factor I keep mentioning:
Bayes Factors, Kass and Raftery, JASA, 1995
http://www.stat.washington.edu/raftery/Research/PDF/kass1995.pdf
which seems to be the most famous one on this topic, at least

Best,

Emanuele

Chris Fonnesbeck

Feb 18, 2011, 9:39:22 AM2/18/11
to PyMC
On Feb 17, 11:15 am, John McFarland <mcfar...@gmail.com> wrote:
> Chris, the posterior distributions for mu_p and mu_g simply aren't the
> same here (PyMC demonstrates this easily: plot the posteriors for mu_p
> and mu_g in my example code).  Estimation of mu is not independent of
> the statistical model employed to describe the data.

*independent*, but I am asserting that for any two models that purport
to describe a particular dataset in a reasonable way should have
expected values that are similar, be that mu for a normal or alpha x
beta for a gamma, or whatever. If not, then I would expect at least
one of those models to have important lack of fit.

Chris Fonnesbeck

Mar 3, 2011, 10:37:11 AM3/3/11
to PyMC
Everyone's probably moved past this now, but having read through the
literature a bit more, it seems that the reason you cannot simply use
the MCMC samples to get at the Bayes factor is that the marginal
likelihood is integrated with respect to the *prior* distribution of
the parameters, not the posterior (as noted by Emanuele in the post
above). Hence, calculating geometric_like and poisson_like at each
iteration are not going to yield samples from the marginal likelihood
as required. Han and Carlin (2001) give a pretty good treatment of the
various approaches to calculating marginal likelihoods and BF. If
anyone is looking for a practical example, the "pines" example in
WinBUGS implements the Carlin and Chib approach described in the
paper. I'll try and get a PyMC equivalent going, for comparison.

Emanuele Olivetti

Mar 3, 2011, 4:45:49 PM3/3/11

Hi Chris,

Thanks for the information and the reference. I did not move on from
this problem :-) but found a simple yet efficient solution to computing
the Bayes factor of the problem of the batch of coins via simple Monte Carlo
integration. I can post more details on that if someone is still interested.

I am still very interested in the many ways to estimate the marginal
likelihoods and the Bayes factor and will look for the "pines" example
you mention. Of course I'd be very happy to see the PyMC equivalent
because I know almost nothing about WinBUGS.

Best,

Emanuele

Thomas Wiecki

Mar 3, 2011, 5:28:03 PM3/3/11
> Hi Chris,
>
> Thanks for the information and the reference. I did not move on from
> this problem :-) but found a simple yet efficient solution to computing
> the Bayes factor of the problem of the batch of coins via simple Monte Carlo
> integration. I can post more details on that if someone is still interested.

Definitely still of interest.

> I am still very interested in the many ways to estimate the marginal
> likelihoods and the Bayes factor and will look for the "pines" example
> you mention. Of course I'd be very happy to see the PyMC equivalent
> because I know almost nothing about WinBUGS.
>
> Best,
>
> Emanuele
>

> --
> You received this message because you are subscribed to the Google Groups
> "PyMC" group.
> To post to this group, send email to py...@googlegroups.com.
> To unsubscribe from this group, send email to
> For more options, visit this group at
>
>

John McFarland

Mar 4, 2011, 10:04:21 AM3/4/11
to PyMC
Thanks for the info, the example and code would be interesting. Might
also be interesting to see how the approximation given in Wasserman
(2000) compares, since it (a) doesn't depend on the prior and (b)
doesn't require integration (just finding the maximum likelihood
estimate), so it's very simple to compute.

Emanuele Olivetti

Mar 5, 2011, 5:06:13 PM3/5/11
On 03/03/2011 11:28 PM, Thomas Wiecki wrote:
>> Hi Chris,
>>
>> Thanks for the information and the reference. I did not move on from
>> this problem :-) but found a simple yet efficient solution to computing
>> the Bayes factor of the problem of the batch of coins via simple Monte Carlo
>> integration. I can post more details on that if someone is still interested.
> Definitely still of interest.
>

So here it is. I am spending time on the problem of the batch of coins
because it is a useful abstraction of a more concrete problem I am
working on. Here you can find the code that implements the computation
of the Bayes factor via simple Monte Carlo integration:
https://github.com/emanuele/Bayes-factor-multi-subject/blob/master/bayes_factor.py

There are a few little changes in the naming of the variables with
respect to the initial problem of the batch of coins I posted. So here
is a brief explanation in order to read the code:
N = number of coins
m = number of tosses of each coin
errors = it is what I called 'heads' in the example I posted, i.e. the number of
heads after m tosses of each coin. It is a 1D array of N integers.

Simple MC integration becomes viable because the true probability of
getting head of each coin, i.e. 'p', can be integrated out if the model is:
p~Beta(a,b)
In this case the pmf of heads becomes "Beta-binomial":
http://en.wikipedia.org/wiki/Beta-binomial_distribution
(Btw I saw that PyMC has an undocumented betabin_like ;-) ).

Since the coins are assumed to be independent of each other then
the marginal likelihood becomes the product of Beta-binomials, one
for each coin.

As mentioned in my previous posts I assume these noninformative
priors for the mean (mu) and std (sigma) of the population
of coins under the two hypothesis:
H_1: mu~Uniform(0,0.5) , sigma~Uniform(0, sqrt(mu*(1-mu))
H_0: mu=0.5 , sigma~Uniform(0,0.5)

Note: the uniform prior of sigma under H_1 is based on [0].

Sampling from those models/hypothesis is easy and inexpensive. Then I
compute (a, b) from (mu, sigma) - see code, and finally compute
the marginal likelihoods via simple MC:

where {(a_1^k,b_1^k}_k is a sample of size K from H_1 and
{(a_0^k,b_0^k}_k is a sample of size K from H_0.

As you can see the code executes in 1-2 sec. since 10^5 iterations are
enough to get a sufficiently stable Bayes factor B_10 (~=2.5 in the example
of the code).

Any question/comment/criticism is warmly welcome.

Best,

Emanuele

[0]: Prior distributions for variance parameters in hierarchical