723 views

Skip to first unread message

Feb 14, 2011, 11:05:27 AM2/14/11

to py...@googlegroups.com

Hi everyone,

I am investigating hypothesis testing with the Bayes factor. I have

two competing models, M0 and M1. The Bayes factor B_10 =

p(observations|M1)/p(observations|M0) is the evidence in favor of M1

with respect to M0 [0]. I've implemented the models with PyMC but I am

getting unstable/puzzling results that's why I asking you for feedback

on the implementation as well as on the whole idea.

Here is a short abstract description of the problem: there is a coin

factory which is suspected to produce biased/loaded coins. For this

specific problem assume that the possible bias is expected to occur

towards one specific side of the coin only, say tail. The two

competing hypothesis are: H0 = the factory produces fair coins, H1 =

the factory produces coins that turn tail too frequently so they are

not fair.

In order to test the hypotheses we get N coins from the factory, say

N=10, and we toss each coin n times, say n=50. The observations

{h_1,...,h_N} are then the N numbers of heads obtained by tossing each

of the N coins m times. Let p_i be the (true) probability to get head

from the i-th coin. The two competing models/hypotheses I want to test

are basically Beta-Binomial with different priors:

M0/H0 : the number of heads h_i of each coin (i=1...N) is

binomially(n,p_i) distributed. All {p_i}_i are drawn from the same

Beta distribution. The mean 'mu' of the Beta is exactly 0.5 (ie. it is

expected that p_i=0.5, ie. the number of heads is the same as that of

tails, on average) and the variance sigma^2 is unknown. As far as I

read from Gelman 2006 [0], a uniform prior on sigma (not sigma2)

should be a good choice for a non-informative prior on variance.

M1/H1 : the number of heads h_i of each coin (i=1...N) is

binomially(n,p_i) distributed. All {p_i}_i are drawn from the same

Beta distribution. The mean 'mu' of the Beta lies in [0,0.5) (ie. it

is expected that p_i<0.5, ie. the number of tails is greater than that

of heads) and the variance sigma^2 is unknown. Non-informative priors

for mu and sigma are again uniform distributions.

Attached you can find an implementation of the two models and an

example with the computation of the Bayes factor. In that example the

coins turn head quite visibly less than half of the times so the Bayes

factor B_10 should be pretty big. Unexpectedly it is low and unstable,

like B_10<3 (and sometimes below 1) despite the generous amount of

sampling iterations.

Best,

Emanuele

[0]: Bayes Factors, Kass and Raftery, JASA, 1995.

[1]: Prior distributions for variance parameters in hierarchical

models, Gelman, Bayesian Analysis, 2006.

Feb 16, 2011, 12:15:08 PM2/16/11

to PyMC

Hi Emanuele,

This is a nice illustrative problem, but I don't think you are

calculating your Bayes factor properly. The BF is the posterior odds

of two models divided by the prior odds, based on this relationship,

which is given in Kass and Raftery:

posterior odds = BF x prior odds

In your calculations, I see neither prior nor posterior probabilities

of each model, which are required to calculate the BF. The posterior

mean of the likelihood values does not give you enough information to

calculate the Bayes factor. The posterior probabilities of the models

can be calculated in a few different ways, while the prior

probabilities of the models are specified by you.

My recent blog post on this might be helpful:

http://stronginference.squarespace.com/weblog/2010/12/16/estimating-bayes-factors-using-pymc.html

cf

This is a nice illustrative problem, but I don't think you are

calculating your Bayes factor properly. The BF is the posterior odds

of two models divided by the prior odds, based on this relationship,

which is given in Kass and Raftery:

posterior odds = BF x prior odds

In your calculations, I see neither prior nor posterior probabilities

of each model, which are required to calculate the BF. The posterior

mean of the likelihood values does not give you enough information to

calculate the Bayes factor. The posterior probabilities of the models

can be calculated in a few different ways, while the prior

probabilities of the models are specified by you.

My recent blog post on this might be helpful:

http://stronginference.squarespace.com/weblog/2010/12/16/estimating-bayes-factors-using-pymc.html

cf

Feb 16, 2011, 2:23:16 PM2/16/11

to py...@googlegroups.com

Hi Chris,

thanks, I haven't seen that blog post. One question about that. How

would you have calculated the BF if you had created two separate

models instead of using the Categorial?

Just FYI, there is a nice short-cut to computing the BF for hypothesis

testing of a prior belief against the posterior belief (i.e. how much

did our certainty in a particular hypothesis change by seeing the

data) called the Savage-Dickey ratio test on which EJ Wagenmakers has

a paper (Wagenmakers et al 2010). Essentially, if one wants to test if

mu==5 the computation of the corresponding BF becomes: BF =

posterior(mu=5) / prior(mu=5). I have some code if you are interested.

-Thomas

> --

> You received this message because you are subscribed to the Google Groups "PyMC" group.

> To post to this group, send email to py...@googlegroups.com.

> To unsubscribe from this group, send email to pymc+uns...@googlegroups.com.

> For more options, visit this group at http://groups.google.com/group/pymc?hl=en.

>

>

Feb 16, 2011, 3:35:03 PM2/16/11

to PyMC

On Feb 16, 1:23 pm, Thomas Wiecki <thomas.wie...@googlemail.com>

wrote:

posterior probability of each model. One other way is to do reversible

jump MCMC (Green 1995), but this is not easy to implement in a general

way. Also, I suppose you could generate DIC model weights for certain

models. I think this approach is the easiest, as long as there aren't

many models.

> Just FYI, there is a nice short-cut to computing the BF for hypothesis

> testing of a prior belief against the posterior belief (i.e. how much

> did our certainty in a particular hypothesis change by seeing the

> data) called the Savage-Dickey ratio test on which EJ Wagenmakers has

> a paper (Wagenmakers et al 2010). Essentially, if one wants to test if

> mu==5 the computation of the corresponding BF becomes: BF =

> posterior(mu=5) / prior(mu=5). I have some code if you are interested.

Always interested. Please forward.

cf

wrote:

> thanks, I haven't seen that blog post. One question about that. How

> would you have calculated the BF if you had created two separate

> models instead of using the Categorial?

>

It gets tricky. You have to have some way of calculating the marginal
> would you have calculated the BF if you had created two separate

> models instead of using the Categorial?

>

posterior probability of each model. One other way is to do reversible

jump MCMC (Green 1995), but this is not easy to implement in a general

way. Also, I suppose you could generate DIC model weights for certain

models. I think this approach is the easiest, as long as there aren't

many models.

> Just FYI, there is a nice short-cut to computing the BF for hypothesis

> testing of a prior belief against the posterior belief (i.e. how much

> did our certainty in a particular hypothesis change by seeing the

> data) called the Savage-Dickey ratio test on which EJ Wagenmakers has

> a paper (Wagenmakers et al 2010). Essentially, if one wants to test if

> mu==5 the computation of the corresponding BF becomes: BF =

> posterior(mu=5) / prior(mu=5). I have some code if you are interested.

cf

Feb 16, 2011, 4:40:00 PM2/16/11

to PyMC

On Feb 16, 2:35 pm, Chris Fonnesbeck <fonnesb...@gmail.com> wrote:

> On Feb 16, 1:23 pm, Thomas Wiecki <thomas.wie...@googlemail.com>

> wrote:

>

> > thanks, I haven't seen that blog post. One question about that. How

> > would you have calculated the BF if you had created two separate

> > models instead of using the Categorial?

>

> It gets tricky. You have to have some way of calculating the marginal

> posterior probability of each model. One other way is to do reversible

> jump MCMC (Green 1995), but this is not easy to implement in a general

> way. Also, I suppose you could generate DIC model weights for certain

> models. I think this approach is the easiest, as long as there aren't

> many models.

Would you not be able to set up a deterministic node or trace to track
> On Feb 16, 1:23 pm, Thomas Wiecki <thomas.wie...@googlemail.com>

> wrote:

>

> > thanks, I haven't seen that blog post. One question about that. How

> > would you have calculated the BF if you had created two separate

> > models instead of using the Categorial?

>

> It gets tricky. You have to have some way of calculating the marginal

> posterior probability of each model. One other way is to do reversible

> jump MCMC (Green 1995), but this is not easy to implement in a general

> way. Also, I suppose you could generate DIC model weights for certain

> models. I think this approach is the easiest, as long as there aren't

> many models.

the likelihood values under each model: P(D|M_i,theta_i)? Then the

MCMC would collect samples of the likelihood and the integration is

performed by taking an expectation.

John

Feb 16, 2011, 5:42:16 PM2/16/11

to py...@googlegroups.com

On Wed, Feb 16, 2011 at 3:35 PM, Chris Fonnesbeck <fonne...@gmail.com> wrote:

> On Feb 16, 1:23 pm, Thomas Wiecki <thomas.wie...@googlemail.com>

> wrote:

>> thanks, I haven't seen that blog post. One question about that. How

>> would you have calculated the BF if you had created two separate

>> models instead of using the Categorial?

>>

>

> It gets tricky. You have to have some way of calculating the marginal

> posterior probability of each model. One other way is to do reversible

> jump MCMC (Green 1995), but this is not easy to implement in a general

> way. Also, I suppose you could generate DIC model weights for certain

> models. I think this approach is the easiest, as long as there aren't

> many models.

> On Feb 16, 1:23 pm, Thomas Wiecki <thomas.wie...@googlemail.com>

> wrote:

>> thanks, I haven't seen that blog post. One question about that. How

>> would you have calculated the BF if you had created two separate

>> models instead of using the Categorial?

>>

>

> It gets tricky. You have to have some way of calculating the marginal

> posterior probability of each model. One other way is to do reversible

> jump MCMC (Green 1995), but this is not easy to implement in a general

> way. Also, I suppose you could generate DIC model weights for certain

> models. I think this approach is the easiest, as long as there aren't

> many models.

Why would calculating DIC be a problem when there are many models?

Also, do you think that DIC (when creating separate models) and BF

(when creating one model) have a direct relationship? Ideally they

should I believe.

>

>> Just FYI, there is a nice short-cut to computing the BF for hypothesis

>> testing of a prior belief against the posterior belief (i.e. how much

>> did our certainty in a particular hypothesis change by seeing the

>> data) called the Savage-Dickey ratio test on which EJ Wagenmakers has

>> a paper (Wagenmakers et al 2010). Essentially, if one wants to test if

>> mu==5 the computation of the corresponding BF becomes: BF =

>> posterior(mu=5) / prior(mu=5). I have some code if you are interested.

>

> Always interested. Please forward.

I haven't tested this in a while and cleaned it up a little, hope it

works. But it should be fairly self-explanatory:

def interpolate_trace(x, trace, range=(-1,1), bins=100):

"""Create a histogram over a trace and interpolate to get a

smoothed distribution."""

import scipy.interpolate

x_histo = np.linspace(range[0], range[1], bins)

histo = np.histogram(trace, bins=bins, range=range, normed=True)[0]

interp = scipy.interpolate.InterpolatedUnivariateSpline(x_histo, histo)(x)

return interp

def savage_dickey(pos, post_trace, range=(-1,1), bins=100,

prior_trace=None, prior_y=None):

"""Calculate Savage-Dickey density ratio test, see Wagenmakers et

al. 2010 at http://dx.doi.org/10.1016/j.cogpsych.2009.12.001

Arguments:

**********

pos<float>: position at which to calculate the savage dickey ratio

at (i.e. the specific hypothesis you want to test)

post_trace<numpy.array>: trace of the posterior distribution

Keyword arguments:

******************

prior_trace<numpy.array>: trace of the prior distribution

prior_y<numpy.array>: prior density at each point (must match

range and bins)

range<(int,int)>: Range over which to interpolate and plot

bins<int>: Over how many bins to compute the histogram over

IMPORTANT: Supply either prior_trace or prior_y.

"""

x = np.linspace(range[0], range[1], bins)

if prior_trace is not None:

# Prior is provided as a trace -> histogram + interpolate

prior_pos = interpolate_trace(pos, prior_trace, range=range, bins=bins)

elif prior_y is not None:

# Prior is provided as a density for each point -> interpolate

to retrieve positional density

import scipy.interpolate

prior_pos = scipy.interpolate.InterpolatedUnivariateSpline(x,

prior_y)(pos)

else:

assert ValueError, "Supply either prior_trace or prior_y

keyword arguments"

# Histogram and interpolate posterior trace at SD position

posterior_pos = interpolate_trace(pos, post_trace, range=range, bins=bins)

# Calculate Savage-Dickey density ratio at pos

sav_dick = posterior_pos / prior_pos

return sav_dick

Feb 16, 2011, 6:00:00 PM2/16/11

to PyMC

> Hi Chris,

>

> thanks, I haven't seen that blog post. One question about that. How

> would you have calculated the BF if you had created two separate

> models instead of using the Categorial?

>

Chris,
>

> thanks, I haven't seen that blog post. One question about that. How

> would you have calculated the BF if you had created two separate

> models instead of using the Categorial?

>

Sort of an extension of Thomas's question, but would your categorical

variable approach work at all if you were comparing models that did

not share parameters? It seems to me that it would not, since your

likelihood function only depends on the parameters of the model

currently indicated by true_model.

I threw together some code to test my previous suggestion of tracking

each model's likelihood with a deterministic node. It appears to work

in pymc, but even with 50,000 samples or so, the Bayes factor is not

very repeatable. Probably this approach is ill-conditioned due to

taking the mean of likelihood values, which tend to be very small.

Maybe there is a way of working directly with the log-likelihood

values?

John

Feb 16, 2011, 6:42:44 PM2/16/11

to PyMC

On Feb 16, 5:00 pm, John McFarland <mcfar...@gmail.com> wrote:

> I threw together some code to test my previous suggestion of tracking

> each model's likelihood with a deterministic node. It appears to work

> in pymc, but even with 50,000 samples or so, the Bayes factor is not

> very repeatable. Probably this approach is ill-conditioned due to

> taking the mean of likelihood values, which tend to be very small.

> Maybe there is a way of working directly with the log-likelihood

> values?

>

> John

*Correction*: Actually, the Bayes factor does converge just fine with
> I threw together some code to test my previous suggestion of tracking

> each model's likelihood with a deterministic node. It appears to work

> in pymc, but even with 50,000 samples or so, the Bayes factor is not

> very repeatable. Probably this approach is ill-conditioned due to

> taking the mean of likelihood values, which tend to be very small.

> Maybe there is a way of working directly with the log-likelihood

> values?

>

> John

this approach (I had a bug where I was computing it based on the

priors for theta instead of posteriors...) However, it is

significantly different than what is obtained using Chris's original

code in which the models share a common variable. Chris reported a

Bayes factor of about 14 in favor of the geometric model. By

comparing the likelihoods directly, with separate parameters for each

model, I obtain a Bayes factor of about 5 in favor of the geometric

model.

One thing I notice is that if each model is given its own parameter,

their posterior distributions are quite different (with the mean for

the geometric model showing much more posterior variance). To me,

this suggests that using the same parameter for both models is not

appropriate...

Code is below:

from pymc import Uniform, Lambda, geometric_like, poisson_like

from numpy import exp, log, array, ones

import pymc

# Data

Y = array([0,1,2,3,8])

# Poisson mean

mu_p = Uniform('mu_p', lower=0, upper=1000, value=4)

# Geometric mean

mu_g = Uniform('mu_g', lower=0, upper=1000, value=4)

# Geometric probability

p = Lambda('p', lambda mu_g=mu_g: 1/(1+mu_g))

Y_p = pymc.Poisson('Y_p', mu_p, value=Y, observed=True)

Y_g = pymc.Geometric('Y_g', p, value=Y+1, observed=True)

@pymc.deterministic

def GL(p=p):

return exp(geometric_like(Y+1, p))

@pymc.deterministic

def PL(mu=mu_p):

return exp(poisson_like(Y, mu))

S = pymc.MCMC()

S.sample(10000,500,2)

print "BF:", S.GL.stats()['mean'] / S.PL.stats()['mean']

Feb 16, 2011, 9:17:39 PM2/16/11

to PyMC

Interesting approach, trying to calculate BF without model

probabilities, but I'm not sure you are getting Bayes factors there.

In fact, I cannot replicate my result using a shared mean -- I get a

value of about 6.9, which is way too low. I'll keep poking around.

probabilities, but I'm not sure you are getting Bayes factors there.

In fact, I cannot replicate my result using a shared mean -- I get a

value of about 6.9, which is way too low. I'll keep poking around.

Feb 16, 2011, 9:28:05 PM2/16/11

to PyMC

and Raftery section 4.3). If you do this:

print "BF:", scipy.stats.hmean(S.trace('GL')[:]) /

stats.hmean(S.trace('PL')[:])

you get a BF of around 11 for the common mean.

cf

Feb 16, 2011, 9:37:24 PM2/16/11

to PyMC

On Feb 16, 8:28 pm, Chris Fonnesbeck <fonnesb...@gmail.com> wrote:

> One issue is that you are supposed to use the harmonic mean (see Kass

> and Raftery section 4.3). If you do this:

>

> print "BF:", scipy.stats.hmean(S.trace('GL')[:]) /

> stats.hmean(S.trace('PL')[:])

>

> you get a BF of around 11 for the common mean.

Second issue: you need PL and GL to include the prior for mu, since
> One issue is that you are supposed to use the harmonic mean (see Kass

> and Raftery section 4.3). If you do this:

>

> print "BF:", scipy.stats.hmean(S.trace('GL')[:]) /

> stats.hmean(S.trace('PL')[:])

>

> you get a BF of around 11 for the common mean.

they are included in the integral for the marginal likelihood. If you

do this, you get a BF of around 13, just like my original code. Here's

the whole model:

mu = Uniform('mu', lower=0, upper=1000, value=4)

# Geometric probability

p = Lambda('p', lambda mu=mu: 1/(1+mu))

Y_p = pymc.Poisson('Y_p', mu, value=Y, observed=True)

Y_g = pymc.Geometric('Y_g', p, value=Y+1, observed=True)

@pymc.deterministic

def GL(p=p):

return exp(geometric_like(Y+1, p) + uniform_like(mu, 0, 1000))
@pymc.deterministic

def GL(p=p):

@pymc.deterministic

def PL(mu=mu):

return exp(poisson_like(Y, mu) + uniform_like(mu, 0, 1000))

S = pymc.MCMC()

S.sample(10000,5000,1)

print "BF:", stats.hmean(S.trace('GL')[:]) / stats.hmean(S.trace('PL')

[:])

Feb 16, 2011, 11:03:44 PM2/16/11

to PyMC

probabilities." The Bayes factor is just a ratio of marginal

likelihoods, right? (http://en.wikipedia.org/wiki/Bayes_factor)

Feb 16, 2011, 11:09:33 PM2/16/11

to PyMC

compute the Bayes factor based on its mathematical definition. The

marginal likelihood is the integration of [P(D|theta_i, M_i) *

P(theta_i|M_i)] with respect to theta_i, which is by definition the

expectation of P(D|theta_i, M_i) with respect to the distribution

P(theta_i|M_i). I am just taking P(theta_i|M_i) to be the posterior

distribution for theta_i under M_i.

Feb 16, 2011, 11:17:30 PM2/16/11

to PyMC

On Feb 16, 8:37 pm, Chris Fonnesbeck <fonnesb...@gmail.com> wrote:

> On Feb 16, 8:28 pm, Chris Fonnesbeck <fonnesb...@gmail.com> wrote:

>

> > One issue is that you are supposed to use the harmonic mean (see Kass

> > and Raftery section 4.3). If you do this:

>

> > print "BF:", scipy.stats.hmean(S.trace('GL')[:]) /

> > stats.hmean(S.trace('PL')[:])

>

> > you get a BF of around 11 for the common mean.

>

> Second issue: you need PL and GL to include the prior for mu, since

> they are included in the integral for the marginal likelihood. If you

> do this, you get a BF of around 13, just like my original code. Here's

> the whole model:

>

Chris, I don't agree with this. See my previous post, but
> On Feb 16, 8:28 pm, Chris Fonnesbeck <fonnesb...@gmail.com> wrote:

>

> > One issue is that you are supposed to use the harmonic mean (see Kass

> > and Raftery section 4.3). If you do this:

>

> > print "BF:", scipy.stats.hmean(S.trace('GL')[:]) /

> > stats.hmean(S.trace('PL')[:])

>

> > you get a BF of around 11 for the common mean.

>

> Second issue: you need PL and GL to include the prior for mu, since

> they are included in the integral for the marginal likelihood. If you

> do this, you get a BF of around 13, just like my original code. Here's

> the whole model:

>

mathematically the integral of [H(x)p(x)] with respect to the

parameter x is the expectation of H(x). In this context, this means

the marginal likelihood, L(Y|M_i), is the expectation of the

conditional likelihood over the distribution p(theta_i|M_i). This

means the marginal likelihood is the expectation of the conditional

likelihood, L(Y|theta_i,M_i). The code you have posted is instead

computing the expectation (in a "harmonic" sense, which I'm still

unclear on) of [L(Y|theta_i,M_i)*Prior(theta_i)], which is

mathematically Int[L(Y|

theta_i,M_i)*Prior(theta_i)*Posterior(theta_i)]d_theta_i, since your

Monte Carlo samples are drawn from the posterior for theta_i.

Feb 16, 2011, 11:52:06 PM2/16/11

to PyMC

On Feb 16, 10:03 pm, John McFarland <mcfar...@gmail.com> wrote:

> Chris, I don't understand what you mean "without model

> probabilities." The Bayes factor is just a ratio of marginal

> likelihoods, right? (http://en.wikipedia.org/wiki/Bayes_factor)

posterior odds divided by the prior odds, which involves model

probabilities.

Feb 17, 2011, 12:00:12 AM2/17/11

to PyMC

On Feb 16, 10:17 pm, John McFarland <mcfar...@gmail.com> wrote:

> Chris, I don't agree with this. See my previous post, but

> mathematically the integral of [H(x)p(x)] with respect to the

> parameter x is the expectation of H(x).

Yes, you're right. Well, I'm stumped then. Calculating the integral
> Chris, I don't agree with this. See my previous post, but

> mathematically the integral of [H(x)p(x)] with respect to the

> parameter x is the expectation of H(x).

directly should result in the same value as dividing the posterior

odds by the prior odds, but they do not appear to.

Feb 17, 2011, 12:06:05 AM2/17/11

to PyMC

posted doesn't give the same Bayes factor because by using a common

mu, I believe you aren't actually calculating the correct P(theta_i|

M_i) for each model. This is what I meant when I pointed out that if

you look at the posterior for mu_p and mu_g, separately under each of

the Poisson and geometric models, those posterior distributions are

different from each other, and so also different from the mixed (?)

posterior distribution for mu you formulated in your original code, in

which you only have one mu.

Feb 17, 2011, 12:08:54 AM2/17/11

to PyMC

> Which results are you comparing against mine? The code you originally

> posted doesn't give the same Bayes factor because by using a common

> mu, I believe you aren't actually calculating the correct P(theta_i|

> M_i) for each model. This is what I meant when I pointed out that if

> you look at the posterior for mu_p and mu_g, separately under each of

> the Poisson and geometric models, those posterior distributions are

> different from each other, and so also different from the mixed (?)

> posterior distribution for mu you formulated in your original code, in

> which you only have one mu.

I'm comparing my (that is, Link and Barker's) model with a common mu
> posted doesn't give the same Bayes factor because by using a common

> mu, I believe you aren't actually calculating the correct P(theta_i|

> M_i) for each model. This is what I meant when I pointed out that if

> you look at the posterior for mu_p and mu_g, separately under each of

> the Poisson and geometric models, those posterior distributions are

> different from each other, and so also different from the mixed (?)

> posterior distribution for mu you formulated in your original code, in

> which you only have one mu.

to your model modified to have a common mu. I can get it close, using

the harmonic mean, but not up to 13.

Feb 17, 2011, 9:08:34 AM2/17/11

to PyMC

recreate your categorical modelling (in order to define Bayes factor

in terms of model probabilities), but I used separate mu parameters.

My approach was to create separate likelihood functions for Y|mu_m,

Y_mu_g, and Y|M, because we want the posterior distribution for the

parameters to be computed independently of which model is selected by

true_model. I guessed that the way to have the Y_M likelihood not try

and update mu_m and mu_g was to use ".value" to grab their current

values instead of giving them as arguments of the Y_M stochastic

(correct me if I'm off base here).

Appears to work, giving a Bayes factor of about 5, which matches my

marginal likelihood code...

# Data

Y = array([0,1,2,3,8])

pi = (0.1, 0.9)

# Poisson mean

mu_p = Uniform('mu_p', lower=0, upper=1000, value=4)

# Geometric mean

# Geometric probability

p = Lambda('p', lambda mu=mu_g: 1/(1+mu))

Y_p = pymc.Poisson('Y_p', mu_p, value=Y, observed=True)

Y_g = pymc.Geometric('Y_g', p, value=Y+1, observed=True)

# Index to true model
true_model = Categorical("true_model", p=pi, value=0)

@observed

def Y_M(value=Y, M=true_model):

"""Either Poisson or geometric, depending on M"""

return geometric_like(value+1, p.value)*(M==0) or

poisson_like(value, mu_p.value)

S = pymc.MCMC()

S.sample(20000,500,2)

post_geometric = 1.0 - S.true_model.stats()['mean']

print "BF:", post_geometric/(1-post_geometric) / (pi[0] / pi[1])

Feb 17, 2011, 9:39:21 AM2/17/11

to py...@googlegroups.com

Hi Chris,

First of all thanks a lot for spending time on my problem. I read

you blog post with much interest and I think I mostly understand

the way you explain for computing the Bayes factor.

My approach, correct or not, was to apply the definition of

Bayes factor as ratio of the data likelihoods under different models

and to estimate p(D|model_i) from traces of the MCMC sampling

of parameters ('p's, as I called them in my coins problem) either

with mean or even better with harmonic mean. My idea was

analogous to that of John posted in this same thread. Please correct

me if my implementation does not reflect this idea. I am still

learning! :-) . In case my approach is not wrong, have you any

clues for the instability of B_10? My latests guess is that I just

need to sample much more.

I have a question about your blog post: why the two models should

share the mu node? I see that Thomas and John asked related

questions but can't see clearly the motivation of this exact one.

Do I have to expect different results when using two different 'mu's?

Best,

Emanuele

Feb 17, 2011, 10:11:16 AM2/17/11

to py...@googlegroups.com

This is a very interesting thread, thanks a lot to all of you for

explanations and examples.

I attempted a simple Monte Carlo to understand what Bayes factor

to expect. I get a result between 13 and 14 (using two different 'mu's,

but even using one the result does not change).

The very few lines of code are attached. It requires ~10 sec. to

run the 1e6 iterations on my slow laptop.

Best,

Emanuele

Feb 17, 2011, 10:40:22 AM2/17/11

to PyMC

> < 1KViewDownload

Emanuele,

Your code computes the Bayes factor under the prior distribution for

the model parameters, which is probably not what you want.

John

Feb 17, 2011, 10:52:41 AM2/17/11

to PyMC

On Feb 17, 8:39 am, Emanuele Olivetti <emanu...@relativita.com> wrote:

> Hi Chris,

>

> First of all thanks a lot for spending time on my problem. I read

> you blog post with much interest and I think I mostly understand

> the way you explain for computing the Bayes factor.

>

> My approach, correct or not, was to apply the definition of

> Bayes factor as ratio of the data likelihoods under different models

> and to estimate p(D|model_i) from traces of the MCMC sampling

> of parameters ('p's, as I called them in my coins problem) either

> with mean or even better with harmonic mean. My idea was

> analogous to that of John posted in this same thread. Please correct

> me if my implementation does not reflect this idea. I am still

> learning! :-) . In case my approach is not wrong, have you any

> clues for the instability of B_10? My latests guess is that I just

> need to sample much more.

Emanuele: Chris may have some additional insight here, but from
> Hi Chris,

>

> First of all thanks a lot for spending time on my problem. I read

> you blog post with much interest and I think I mostly understand

> the way you explain for computing the Bayes factor.

>

> My approach, correct or not, was to apply the definition of

> Bayes factor as ratio of the data likelihoods under different models

> and to estimate p(D|model_i) from traces of the MCMC sampling

> of parameters ('p's, as I called them in my coins problem) either

> with mean or even better with harmonic mean. My idea was

> analogous to that of John posted in this same thread. Please correct

> me if my implementation does not reflect this idea. I am still

> learning! :-) . In case my approach is not wrong, have you any

> clues for the instability of B_10? My latests guess is that I just

> need to sample much more.

looking quickly at your code, I *think* one problem is that you need

to account for the likelihood of the p_i values under each Beta

distribution model. Some quick trace plots seem to indicate that the

posterior distributions for p_i are basically the same under each

model. The difference in support under M0 and M1 should come from the

fact that the p_i's you are seeing are more likely under M1 than M0.

Specifically, I think you need to add this likelihood calculation in

your function posterior_likelihood_heads.

>

> I have a question about your blog post: why the two models should

> share the mu node? I see that Thomas and John asked related

> questions but can't see clearly the motivation of this exact one.

> Do I have to expect different results when using two different 'mu's?

>

> Best,

>

> Emanuele

the same mu node is simply incorrect. If you're unsure, I would

suggest considering more general cases: for example, what if you have

two models that can't be represented with the same parameters (for

example, a normal distribution and a triangular distribution)?

Feb 17, 2011, 12:01:42 PM2/17/11

to PyMC

On Feb 17, 9:52 am, John McFarland <mcfar...@gmail.com> wrote:

> Chris may have a different opinion here, but I argue that the use of

> the same mu node is simply incorrect. If you're unsure, I would

> suggest considering more general cases: for example, what if you have

> two models that can't be represented with the same parameters (for

> example, a normal distribution and a triangular distribution)?

I do disagree with this. It may be a better model to have separate mu
> Chris may have a different opinion here, but I argue that the use of

> the same mu node is simply incorrect. If you're unsure, I would

> suggest considering more general cases: for example, what if you have

> two models that can't be represented with the same parameters (for

> example, a normal distribution and a triangular distribution)?

values, but there is nothing inherently wrong with specifying them to

be the same. The unknown model that generated the data has an expected

value, and the MLE of that expected value will be the mean of the

data, at least for these sorts of simple models. Your comparison with

a triangular distribution is a red herring -- just because you can't

express its parameters as an expected value, does not mean that the

expected values under the normal and the triangular will not be the

same.

cf

Feb 17, 2011, 12:03:44 PM2/17/11

to py...@googlegroups.com

Why not? According to the definition the Bayes factor is the ratio

of the densities p(data|model_1)/p(data|model_2) obtained by

integrating over the parameter space:

p(data|model_k) = \int p(data|\theta_k, model_k) \pi(\theta_k|model_k) d\theta_k

where \pi(\theta_k|model_k) is its prior density (see for example Kass and

Raftery JASA 1995).

The basic way to estimate p(data|model_k) is by simple Monte Carlo

estimation:

p(data|model_k) = 1/m \sum_{i=1}^m p(data|\theta^i)

where {\theta^i}_i is a sample from the prior distribution

(see for example Section 4.2 of the paper mentioned above).

My understanding is that my implementation follows directly

from the Bayes factor definition.

Best,

Emanuele

Feb 17, 2011, 12:15:28 PM2/17/11

to PyMC

same here (PyMC demonstrates this easily: plot the posteriors for mu_p

and mu_g in my example code). Estimation of mu is not independent of

the statistical model employed to describe the data.

Admittedly, my normal/triangular example is not a great one. Better

examples arise if we consider regression models. For example,

consider calculation of a Bayes factor for two models for E[Y|x] for a

covariate x: M0: alpha*x, and M1: beta*x**2. Each model has one

parameter, but the parameters simply aren't related to each other. As

the models for E[Y|x] become more complex, this becomes more and more

evident.

Feb 17, 2011, 12:23:28 PM2/17/11

to PyMC

On Feb 17, 11:03 am, Emanuele Olivetti <emanu...@relativita.com>

If I get a chance, I will try to track down this reference. However,

my understanding is that p(theta_i|model_i) is simply any distribution

for theta_i that you choose; ideally, it represents your present state

of knowledge, which in a Bayesian sense is the posterior distribution

for theta_i under model_i. If you average the likelihood with respect

to the prior, your theta_i values aren't informed by the data. This

would tell you which model better supports the data, under their prior

distributions for theta_i. However, since you would more likely apply

the models for prediction using posterior distributions for theta_i, I

would be more interested to know which model better supports the data

under the posterior distributions for theta_i.

my understanding is that p(theta_i|model_i) is simply any distribution

for theta_i that you choose; ideally, it represents your present state

of knowledge, which in a Bayesian sense is the posterior distribution

for theta_i under model_i. If you average the likelihood with respect

to the prior, your theta_i values aren't informed by the data. This

would tell you which model better supports the data, under their prior

distributions for theta_i. However, since you would more likely apply

the models for prediction using posterior distributions for theta_i, I

would be more interested to know which model better supports the data

under the posterior distributions for theta_i.

Feb 17, 2011, 2:28:44 PM2/17/11

to PyMC

On Feb 17, 11:23 am, John McFarland <mcfar...@gmail.com> wrote:

>

> If I get a chance, I will try to track down this reference. However,

> my understanding is that p(theta_i|model_i) is simply any distribution

> for theta_i that you choose; ideally, it represents your present state

> of knowledge, which in a Bayesian sense is the posterior distribution

> for theta_i under model_i. If you average the likelihood with respect

> to the prior, your theta_i values aren't informed by the data. This

> would tell you which model better supports the data, under their prior

> distributions for theta_i. However, since you would more likely apply

> the models for prediction using posterior distributions for theta_i, I

> would be more interested to know which model better supports the data

> under the posterior distributions for theta_i.

Well, some quick literature review and I need to apologize Emanuele:
>

> If I get a chance, I will try to track down this reference. However,

> my understanding is that p(theta_i|model_i) is simply any distribution

> for theta_i that you choose; ideally, it represents your present state

> of knowledge, which in a Bayesian sense is the posterior distribution

> for theta_i under model_i. If you average the likelihood with respect

> to the prior, your theta_i values aren't informed by the data. This

> would tell you which model better supports the data, under their prior

> distributions for theta_i. However, since you would more likely apply

> the models for prediction using posterior distributions for theta_i, I

> would be more interested to know which model better supports the data

> under the posterior distributions for theta_i.

it looks like I was wrong and p(theta_i|model_i) is interpreted as a

prior distribution for theta_i. Sorry for all the confusion I'm

causing here. Conceptually it is a little difficult for me to accept

measuring the models based on prior distributions for theta, since I

have a preference for vague/noninformative/reference priors, which

appear to give rise to various problems when used in conjunction with

the calculation of Bayes factors.

See, for example, Wasserman (2000; discussion around Eq. (24)), where

he states that the use of noninformative prior distributions gives

rise to arbitrary constants in the Bayes factor definition. He gives

an approximation (Eq. 25) for the marginal likelihoods that does not

depend on the prior for theta.

Maybe we can at least agree that calculation of Bayes factors is

trickier than it first appears...

Wasserman, L., Bayesian Model Selection and Model Averaging," Journal

of Mathematical Psychology, Vol. 44, No. 1, 2000.

Kass, R. E. and Wasserman, L., A reference Bayesian test for nested

hypotheses and its relationship to the Schwarz criterion," Journal of

the American Statistical Assocaition, Vol. 90, 1995, pp. 928-934.

Feb 18, 2011, 4:38:34 AM2/18/11

to py...@googlegroups.com

On 02/17/2011 08:28 PM, John McFarland wrote:

> Well, some quick literature review and I need to apologize Emanuele:

> it looks like I was wrong and p(theta_i|model_i) is interpreted as a

> prior distribution for theta_i. Sorry for all the confusion I'm

> causing here. Conceptually it is a little difficult for me to accept

> measuring the models based on prior distributions for theta, since I

> have a preference for vague/noninformative/reference priors, which

> appear to give rise to various problems when used in conjunction with

> the calculation of Bayes factors.

>

> See, for example, Wasserman (2000; discussion around Eq. (24)), where

> he states that the use of noninformative prior distributions gives

> rise to arbitrary constants in the Bayes factor definition. He gives

> an approximation (Eq. 25) for the marginal likelihoods that does not

> depend on the prior for theta.

>

> Maybe we can at least agree that calculation of Bayes factors is

> trickier than it first appears...

>

> Wasserman, L., Bayesian Model Selection and Model Averaging," Journal

> of Mathematical Psychology, Vol. 44, No. 1, 2000.

>

> Kass, R. E. and Wasserman, L., A reference Bayesian test for nested

> hypotheses and its relationship to the Schwarz criterion," Journal of

> the American Statistical Assocaition, Vol. 90, 1995, pp. 928-934.

>

> Well, some quick literature review and I need to apologize Emanuele:

> it looks like I was wrong and p(theta_i|model_i) is interpreted as a

> prior distribution for theta_i. Sorry for all the confusion I'm

> causing here. Conceptually it is a little difficult for me to accept

> measuring the models based on prior distributions for theta, since I

> have a preference for vague/noninformative/reference priors, which

> appear to give rise to various problems when used in conjunction with

> the calculation of Bayes factors.

>

> See, for example, Wasserman (2000; discussion around Eq. (24)), where

> he states that the use of noninformative prior distributions gives

> rise to arbitrary constants in the Bayes factor definition. He gives

> an approximation (Eq. 25) for the marginal likelihoods that does not

> depend on the prior for theta.

>

> Maybe we can at least agree that calculation of Bayes factors is

> trickier than it first appears...

>

> Wasserman, L., Bayesian Model Selection and Model Averaging," Journal

> of Mathematical Psychology, Vol. 44, No. 1, 2000.

>

> Kass, R. E. and Wasserman, L., A reference Bayesian test for nested

> hypotheses and its relationship to the Schwarz criterion," Journal of

> the American Statistical Assocaition, Vol. 90, 1995, pp. 928-934.

>

John,

No need to apologize! The discussion is extremely interesting.

Indeed the choice of noninformative prior is tricky, in the sense

that it may not be clear what prior is really noninformative. In

my initial post I mentioned this paper by Gelman

Prior distributions for variance parameters in hierarchical

models, Gelman, Bayesian Analysis, 2006.

http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf

which discusses the choice of the prior for the variance of the

population in hierarchical models and provides very interesting

recommendations for noninformative priors. For example the paper

shows that commonly used priors (gamma etc.) are not appropriate.

It is my understanding that the problem of the batch of coins falls

within the range of this paper. [The problem proposed by Chris in

his blog post skips this issue.]

Thanks for the two references by Wasserman and Kass, I did not

know about them. Since I am spending quite a bit of time on the

Bayes factor lately good references are always warmly welcome!

For whoever is interested in the topic, here you can freely download

the paper about the Bayes factor I keep mentioning:

Bayes Factors, Kass and Raftery, JASA, 1995

http://www.stat.washington.edu/raftery/Research/PDF/kass1995.pdf

which seems to be the most famous one on this topic, at least

according to google scholar.

Best,

Emanuele

Feb 18, 2011, 9:39:22 AM2/18/11

to PyMC

On Feb 17, 11:15 am, John McFarland <mcfar...@gmail.com> wrote:

> Chris, the posterior distributions for mu_p and mu_g simply aren't the

> same here (PyMC demonstrates this easily: plot the posteriors for mu_p

> and mu_g in my example code). Estimation of mu is not independent of

> the statistical model employed to describe the data.

Sorry if I sound argumentative about this. I agree that they are not
> Chris, the posterior distributions for mu_p and mu_g simply aren't the

> same here (PyMC demonstrates this easily: plot the posteriors for mu_p

> and mu_g in my example code). Estimation of mu is not independent of

> the statistical model employed to describe the data.

*independent*, but I am asserting that for any two models that purport

to describe a particular dataset in a reasonable way should have

expected values that are similar, be that mu for a normal or alpha x

beta for a gamma, or whatever. If not, then I would expect at least

one of those models to have important lack of fit.

Mar 3, 2011, 10:37:11 AM3/3/11

to PyMC

Everyone's probably moved past this now, but having read through the

literature a bit more, it seems that the reason you cannot simply use

the MCMC samples to get at the Bayes factor is that the marginal

likelihood is integrated with respect to the *prior* distribution of

the parameters, not the posterior (as noted by Emanuele in the post

above). Hence, calculating geometric_like and poisson_like at each

iteration are not going to yield samples from the marginal likelihood

as required. Han and Carlin (2001) give a pretty good treatment of the

various approaches to calculating marginal likelihoods and BF. If

anyone is looking for a practical example, the "pines" example in

WinBUGS implements the Carlin and Chib approach described in the

paper. I'll try and get a PyMC equivalent going, for comparison.

literature a bit more, it seems that the reason you cannot simply use

the MCMC samples to get at the Bayes factor is that the marginal

likelihood is integrated with respect to the *prior* distribution of

the parameters, not the posterior (as noted by Emanuele in the post

above). Hence, calculating geometric_like and poisson_like at each

iteration are not going to yield samples from the marginal likelihood

as required. Han and Carlin (2001) give a pretty good treatment of the

various approaches to calculating marginal likelihoods and BF. If

anyone is looking for a practical example, the "pines" example in

WinBUGS implements the Carlin and Chib approach described in the

paper. I'll try and get a PyMC equivalent going, for comparison.

Mar 3, 2011, 4:45:49 PM3/3/11

to py...@googlegroups.com

Hi Chris,

Thanks for the information and the reference. I did not move on from

this problem :-) but found a simple yet efficient solution to computing

the Bayes factor of the problem of the batch of coins via simple Monte Carlo

integration. I can post more details on that if someone is still interested.

I am still very interested in the many ways to estimate the marginal

likelihoods and the Bayes factor and will look for the "pines" example

you mention. Of course I'd be very happy to see the PyMC equivalent

because I know almost nothing about WinBUGS.

Best,

Emanuele

Mar 3, 2011, 5:28:03 PM3/3/11

to py...@googlegroups.com

> Hi Chris,

>

> Thanks for the information and the reference. I did not move on from

> this problem :-) but found a simple yet efficient solution to computing

> the Bayes factor of the problem of the batch of coins via simple Monte Carlo

> integration. I can post more details on that if someone is still interested.

>

> Thanks for the information and the reference. I did not move on from

> this problem :-) but found a simple yet efficient solution to computing

> the Bayes factor of the problem of the batch of coins via simple Monte Carlo

> integration. I can post more details on that if someone is still interested.

Definitely still of interest.

> I am still very interested in the many ways to estimate the marginal

> likelihoods and the Bayes factor and will look for the "pines" example

> you mention. Of course I'd be very happy to see the PyMC equivalent

> because I know almost nothing about WinBUGS.

>

> Best,

>

> Emanuele

>

> --

> You received this message because you are subscribed to the Google Groups

> "PyMC" group.

> To post to this group, send email to py...@googlegroups.com.

> To unsubscribe from this group, send email to

> pymc+uns...@googlegroups.com.

> For more options, visit this group at

> http://groups.google.com/group/pymc?hl=en.

>

>

Mar 4, 2011, 10:04:21 AM3/4/11

to PyMC

also be interesting to see how the approximation given in Wasserman

(2000) compares, since it (a) doesn't depend on the prior and (b)

doesn't require integration (just finding the maximum likelihood

estimate), so it's very simple to compute.

Mar 5, 2011, 5:06:13 PM3/5/11

to py...@googlegroups.com

On 03/03/2011 11:28 PM, Thomas Wiecki wrote:

>> Hi Chris,

>>

>> Thanks for the information and the reference. I did not move on from

>> this problem :-) but found a simple yet efficient solution to computing

>> the Bayes factor of the problem of the batch of coins via simple Monte Carlo

>> integration. I can post more details on that if someone is still interested.

> Definitely still of interest.

>

>> Hi Chris,

>>

>> Thanks for the information and the reference. I did not move on from

>> this problem :-) but found a simple yet efficient solution to computing

>> the Bayes factor of the problem of the batch of coins via simple Monte Carlo

>> integration. I can post more details on that if someone is still interested.

> Definitely still of interest.

>

So here it is. I am spending time on the problem of the batch of coins

because it is a useful abstraction of a more concrete problem I am

working on. Here you can find the code that implements the computation

of the Bayes factor via simple Monte Carlo integration:

https://github.com/emanuele/Bayes-factor-multi-subject/blob/master/bayes_factor.py

There are a few little changes in the naming of the variables with

respect to the initial problem of the batch of coins I posted. So here

is a brief explanation in order to read the code:

N = number of coins

m = number of tosses of each coin

errors = it is what I called 'heads' in the example I posted, i.e. the number of

heads after m tosses of each coin. It is a 1D array of N integers.

Simple MC integration becomes viable because the true probability of

getting head of each coin, i.e. 'p', can be integrated out if the model is:

heads~Bin(m,p)

p~Beta(a,b)

In this case the pmf of heads becomes "Beta-binomial":

http://en.wikipedia.org/wiki/Beta-binomial_distribution

(Btw I saw that PyMC has an undocumented betabin_like ;-) ).

Since the coins are assumed to be independent of each other then

the marginal likelihood becomes the product of Beta-binomials, one

for each coin.

As mentioned in my previous posts I assume these noninformative

priors for the mean (mu) and std (sigma) of the population

of coins under the two hypothesis:

H_1: mu~Uniform(0,0.5) , sigma~Uniform(0, sqrt(mu*(1-mu))

H_0: mu=0.5 , sigma~Uniform(0,0.5)

Note: the uniform prior of sigma under H_1 is based on [0].

Sampling from those models/hypothesis is easy and inexpensive. Then I

compute (a, b) from (mu, sigma) - see code, and finally compute

the marginal likelihoods via simple MC:

p(heads|H_i) = 1/K \sum_k \prod_{i=1...N} Betabin(heads[i]|m,a_i^k,b_i^k)

where {(a_1^k,b_1^k}_k is a sample of size K from H_1 and

{(a_0^k,b_0^k}_k is a sample of size K from H_0.

As you can see the code executes in 1-2 sec. since 10^5 iterations are

enough to get a sufficiently stable Bayes factor B_10 (~=2.5 in the example

of the code).

Any question/comment/criticism is warmly welcome.

Best,

Emanuele

[0]: Prior distributions for variance parameters in hierarchical

Reply all

Reply to author

Forward

0 new messages