I am investigating hypothesis testing with the Bayes factor. I have
two competing models, M0 and M1. The Bayes factor B_10 =
p(observations|M1)/p(observations|M0) is the evidence in favor of M1
with respect to M0 [0]. I've implemented the models with PyMC but I am
getting unstable/puzzling results that's why I asking you for feedback
on the implementation as well as on the whole idea.
Here is a short abstract description of the problem: there is a coin
factory which is suspected to produce biased/loaded coins. For this
specific problem assume that the possible bias is expected to occur
towards one specific side of the coin only, say tail. The two
competing hypothesis are: H0 = the factory produces fair coins, H1 =
the factory produces coins that turn tail too frequently so they are
not fair.
In order to test the hypotheses we get N coins from the factory, say
N=10, and we toss each coin n times, say n=50. The observations
{h_1,...,h_N} are then the N numbers of heads obtained by tossing each
of the N coins m times. Let p_i be the (true) probability to get head
from the i-th coin. The two competing models/hypotheses I want to test
are basically Beta-Binomial with different priors:
M0/H0 : the number of heads h_i of each coin (i=1...N) is
binomially(n,p_i) distributed. All {p_i}_i are drawn from the same
Beta distribution. The mean 'mu' of the Beta is exactly 0.5 (ie. it is
expected that p_i=0.5, ie. the number of heads is the same as that of
tails, on average) and the variance sigma^2 is unknown. As far as I
read from Gelman 2006 [0], a uniform prior on sigma (not sigma2)
should be a good choice for a non-informative prior on variance.
M1/H1 : the number of heads h_i of each coin (i=1...N) is
binomially(n,p_i) distributed. All {p_i}_i are drawn from the same
Beta distribution. The mean 'mu' of the Beta lies in [0,0.5) (ie. it
is expected that p_i<0.5, ie. the number of tails is greater than that
of heads) and the variance sigma^2 is unknown. Non-informative priors
for mu and sigma are again uniform distributions.
Attached you can find an implementation of the two models and an
example with the computation of the Bayes factor. In that example the
coins turn head quite visibly less than half of the times so the Bayes
factor B_10 should be pretty big. Unexpectedly it is low and unstable,
like B_10<3 (and sometimes below 1) despite the generous amount of
sampling iterations.
Best,
Emanuele
[0]: Bayes Factors, Kass and Raftery, JASA, 1995.
[1]: Prior distributions for variance parameters in hierarchical
models, Gelman, Bayesian Analysis, 2006.
thanks, I haven't seen that blog post. One question about that. How
would you have calculated the BF if you had created two separate
models instead of using the Categorial?
Just FYI, there is a nice short-cut to computing the BF for hypothesis
testing of a prior belief against the posterior belief (i.e. how much
did our certainty in a particular hypothesis change by seeing the
data) called the Savage-Dickey ratio test on which EJ Wagenmakers has
a paper (Wagenmakers et al 2010). Essentially, if one wants to test if
mu==5 the computation of the corresponding BF becomes: BF =
posterior(mu=5) / prior(mu=5). I have some code if you are interested.
-Thomas
> --
> You received this message because you are subscribed to the Google Groups "PyMC" group.
> To post to this group, send email to py...@googlegroups.com.
> To unsubscribe from this group, send email to pymc+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/pymc?hl=en.
>
>
Why would calculating DIC be a problem when there are many models?
Also, do you think that DIC (when creating separate models) and BF
(when creating one model) have a direct relationship? Ideally they
should I believe.
>
>> Just FYI, there is a nice short-cut to computing the BF for hypothesis
>> testing of a prior belief against the posterior belief (i.e. how much
>> did our certainty in a particular hypothesis change by seeing the
>> data) called the Savage-Dickey ratio test on which EJ Wagenmakers has
>> a paper (Wagenmakers et al 2010). Essentially, if one wants to test if
>> mu==5 the computation of the corresponding BF becomes: BF =
>> posterior(mu=5) / prior(mu=5). I have some code if you are interested.
>
> Always interested. Please forward.
I haven't tested this in a while and cleaned it up a little, hope it
works. But it should be fairly self-explanatory:
def interpolate_trace(x, trace, range=(-1,1), bins=100):
"""Create a histogram over a trace and interpolate to get a
smoothed distribution."""
import scipy.interpolate
x_histo = np.linspace(range[0], range[1], bins)
histo = np.histogram(trace, bins=bins, range=range, normed=True)[0]
interp = scipy.interpolate.InterpolatedUnivariateSpline(x_histo, histo)(x)
return interp
def savage_dickey(pos, post_trace, range=(-1,1), bins=100,
prior_trace=None, prior_y=None):
"""Calculate Savage-Dickey density ratio test, see Wagenmakers et
al. 2010 at http://dx.doi.org/10.1016/j.cogpsych.2009.12.001
Arguments:
**********
pos<float>: position at which to calculate the savage dickey ratio
at (i.e. the specific hypothesis you want to test)
post_trace<numpy.array>: trace of the posterior distribution
Keyword arguments:
******************
prior_trace<numpy.array>: trace of the prior distribution
prior_y<numpy.array>: prior density at each point (must match
range and bins)
range<(int,int)>: Range over which to interpolate and plot
bins<int>: Over how many bins to compute the histogram over
IMPORTANT: Supply either prior_trace or prior_y.
"""
x = np.linspace(range[0], range[1], bins)
if prior_trace is not None:
# Prior is provided as a trace -> histogram + interpolate
prior_pos = interpolate_trace(pos, prior_trace, range=range, bins=bins)
elif prior_y is not None:
# Prior is provided as a density for each point -> interpolate
to retrieve positional density
import scipy.interpolate
prior_pos = scipy.interpolate.InterpolatedUnivariateSpline(x,
prior_y)(pos)
else:
assert ValueError, "Supply either prior_trace or prior_y
keyword arguments"
# Histogram and interpolate posterior trace at SD position
posterior_pos = interpolate_trace(pos, post_trace, range=range, bins=bins)
# Calculate Savage-Dickey density ratio at pos
sav_dick = posterior_pos / prior_pos
return sav_dick
Hi Chris,
First of all thanks a lot for spending time on my problem. I read
you blog post with much interest and I think I mostly understand
the way you explain for computing the Bayes factor.
My approach, correct or not, was to apply the definition of
Bayes factor as ratio of the data likelihoods under different models
and to estimate p(D|model_i) from traces of the MCMC sampling
of parameters ('p's, as I called them in my coins problem) either
with mean or even better with harmonic mean. My idea was
analogous to that of John posted in this same thread. Please correct
me if my implementation does not reflect this idea. I am still
learning! :-) . In case my approach is not wrong, have you any
clues for the instability of B_10? My latests guess is that I just
need to sample much more.
I have a question about your blog post: why the two models should
share the mu node? I see that Thomas and John asked related
questions but can't see clearly the motivation of this exact one.
Do I have to expect different results when using two different 'mu's?
Best,
Emanuele
This is a very interesting thread, thanks a lot to all of you for
explanations and examples.
I attempted a simple Monte Carlo to understand what Bayes factor
to expect. I get a result between 13 and 14 (using two different 'mu's,
but even using one the result does not change).
The very few lines of code are attached. It requires ~10 sec. to
run the 1e6 iterations on my slow laptop.
Best,
Emanuele
Why not? According to the definition the Bayes factor is the ratio
of the densities p(data|model_1)/p(data|model_2) obtained by
integrating over the parameter space:
p(data|model_k) = \int p(data|\theta_k, model_k) \pi(\theta_k|model_k) d\theta_k
where \pi(\theta_k|model_k) is its prior density (see for example Kass and
Raftery JASA 1995).
The basic way to estimate p(data|model_k) is by simple Monte Carlo
estimation:
p(data|model_k) = 1/m \sum_{i=1}^m p(data|\theta^i)
where {\theta^i}_i is a sample from the prior distribution
(see for example Section 4.2 of the paper mentioned above).
My understanding is that my implementation follows directly
from the Bayes factor definition.
Best,
Emanuele
John,
No need to apologize! The discussion is extremely interesting.
Indeed the choice of noninformative prior is tricky, in the sense
that it may not be clear what prior is really noninformative. In
my initial post I mentioned this paper by Gelman
Prior distributions for variance parameters in hierarchical
models, Gelman, Bayesian Analysis, 2006.
http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf
which discusses the choice of the prior for the variance of the
population in hierarchical models and provides very interesting
recommendations for noninformative priors. For example the paper
shows that commonly used priors (gamma etc.) are not appropriate.
It is my understanding that the problem of the batch of coins falls
within the range of this paper. [The problem proposed by Chris in
his blog post skips this issue.]
Thanks for the two references by Wasserman and Kass, I did not
know about them. Since I am spending quite a bit of time on the
Bayes factor lately good references are always warmly welcome!
For whoever is interested in the topic, here you can freely download
the paper about the Bayes factor I keep mentioning:
Bayes Factors, Kass and Raftery, JASA, 1995
http://www.stat.washington.edu/raftery/Research/PDF/kass1995.pdf
which seems to be the most famous one on this topic, at least
according to google scholar.
Best,
Emanuele
Hi Chris,
Thanks for the information and the reference. I did not move on from
this problem :-) but found a simple yet efficient solution to computing
the Bayes factor of the problem of the batch of coins via simple Monte Carlo
integration. I can post more details on that if someone is still interested.
I am still very interested in the many ways to estimate the marginal
likelihoods and the Bayes factor and will look for the "pines" example
you mention. Of course I'd be very happy to see the PyMC equivalent
because I know almost nothing about WinBUGS.
Best,
Emanuele
Definitely still of interest.
> I am still very interested in the many ways to estimate the marginal
> likelihoods and the Bayes factor and will look for the "pines" example
> you mention. Of course I'd be very happy to see the PyMC equivalent
> because I know almost nothing about WinBUGS.
>
> Best,
>
> Emanuele
>
> --
> You received this message because you are subscribed to the Google Groups
> "PyMC" group.
> To post to this group, send email to py...@googlegroups.com.
> To unsubscribe from this group, send email to
> pymc+uns...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/pymc?hl=en.
>
>
So here it is. I am spending time on the problem of the batch of coins
because it is a useful abstraction of a more concrete problem I am
working on. Here you can find the code that implements the computation
of the Bayes factor via simple Monte Carlo integration:
https://github.com/emanuele/Bayes-factor-multi-subject/blob/master/bayes_factor.py
There are a few little changes in the naming of the variables with
respect to the initial problem of the batch of coins I posted. So here
is a brief explanation in order to read the code:
N = number of coins
m = number of tosses of each coin
errors = it is what I called 'heads' in the example I posted, i.e. the number of
heads after m tosses of each coin. It is a 1D array of N integers.
Simple MC integration becomes viable because the true probability of
getting head of each coin, i.e. 'p', can be integrated out if the model is:
heads~Bin(m,p)
p~Beta(a,b)
In this case the pmf of heads becomes "Beta-binomial":
http://en.wikipedia.org/wiki/Beta-binomial_distribution
(Btw I saw that PyMC has an undocumented betabin_like ;-) ).
Since the coins are assumed to be independent of each other then
the marginal likelihood becomes the product of Beta-binomials, one
for each coin.
As mentioned in my previous posts I assume these noninformative
priors for the mean (mu) and std (sigma) of the population
of coins under the two hypothesis:
H_1: mu~Uniform(0,0.5) , sigma~Uniform(0, sqrt(mu*(1-mu))
H_0: mu=0.5 , sigma~Uniform(0,0.5)
Note: the uniform prior of sigma under H_1 is based on [0].
Sampling from those models/hypothesis is easy and inexpensive. Then I
compute (a, b) from (mu, sigma) - see code, and finally compute
the marginal likelihoods via simple MC:
p(heads|H_i) = 1/K \sum_k \prod_{i=1...N} Betabin(heads[i]|m,a_i^k,b_i^k)
where {(a_1^k,b_1^k}_k is a sample of size K from H_1 and
{(a_0^k,b_0^k}_k is a sample of size K from H_0.
As you can see the code executes in 1-2 sec. since 10^5 iterations are
enough to get a sufficiently stable Bayes factor B_10 (~=2.5 in the example
of the code).
Any question/comment/criticism is warmly welcome.
Best,
Emanuele
[0]: Prior distributions for variance parameters in hierarchical