[SciPy-User] Unit testing of Bayesian estimator

7 views

Skip to first unread message

Anne Archibald

unread,

Nov 6, 2009, 7:13:20 PM11/6/09

to SciPy Users List

Hi,

I have implemented a simple Bayesian regression program (it takes
events modulo one and returns a posterior probability that the data is
phase-invariant plus a posterior distribution for two parameters
(modulation fraction and phase) in case there is modulation). I'm
rather new at this, so I'd like to construct some unit tests. Does
anyone have any suggestions on how to go about this?

For a frequentist periodicity detector, the return value is a
probability that, given the null hypothesis is true, the statistic
would be this extreme. So I can construct a strong unit test by
generating a collection of data sets given the null hypothesis,
evaluating the statistic, and seeing whether the number that claim to
be significant at a 5% level is really 5%. (In fact I can use the
binomial distribution to get limits on the number of false positive.)
This gives me a unit test that is completely orthogonal to my
implementation, and that passes if and only if the code works. For a
Bayesian hypothesis testing setup, I don't really see how to do
something analogous.

I can generate non-modulated data sets and confirm that my code
returns a high probability that the data is not modulated, but how
high should I expect the probability to be? I can generate data sets
with models with known parameters and check that the best-fit
parameters are close to the known parameters - but how close? Even if
I do it many times, is the posterior mean unbiased? What about the
posterior mode or median? I can even generate models and then data
sets that are drawn from the prior distribution, but what should I
expect from the code output on such a data set? I feel sure there's
some test that verifies a statistical property of Bayesian
estimators/hypothesis testers, but I cant quite put my finger on it.

Suggestions welcome.

Thanks,
Anne
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

josef...@gmail.com

unread,

Nov 6, 2009, 10:37:44 PM11/6/09

to SciPy Users List

The Bayesian experts are at pymc, maybe you can look at there tests
for inspiration. I don't know those, since I never looked at that
part.

I never tried to test a Bayesian estimator but many properties are
still the same as in the non-Bayesian analysis. In my Bayesian past, I
essentially only used normal and t distributions, and binomial.

One of my first tests for these things is to create a huge sample and
see whether the parameter estimates converge. With Bayesian analysis
you still have the law of large numbers, (for non-dogmatic priors)
Do you have an example with a known posterior? Then, the posterior
with a large sample or the average in a Monte Carlo should still be
approximately the true one.
For symmetric distributions, the Bayesian posterior confidence
intervals and posterior mean should be roughly the same as the
frequentist estimates. With diffuse priors, in many cases the results
are exactly the same in Bayesian and MLE.
Another version I used in the past is to trace the posterior mean, as
the prior variance is reduced, in one extreme you should get the prior
back in the other extreme the MLE.

> I can even generate models and then data
> sets that are drawn from the prior distribution, but what should I
> expect from the code output on such a data set?

If you ignore the Bayesian interpretation, then this is just a
standard sampling problem, you draw prior parameters and observations,
the rest is just finding the conditional and marginal probabilities. I
think the posterior odds ratio should converge in a large Monte Carlo
to the true one, and the significance levels should correspond to the
one that has been set for the test (5%).
(In simplest case of conjugate priors, you can just interpret the
prior as a previous sample and you are back to a frequentist
explanation.)

The problem is that with an informative prior, you always have a
biased estimator in small samples and the posterior odds ratio is
affected by an informative prior. And "real" Bayesians don't care
about sampling properties.

What are your prior distributions and the likelihood function in your
case? Can you model degenerate and diffuse priors, so that an
informative prior doesn't influence you sampling results?
I'm trying to think of special cases where you could remove the effect
of the prior.

It's a bit vague because I don't see the details, and I haven't looked
at this in a while.

Tom K.

unread,

Nov 7, 2009, 10:09:27 AM11/7/09

to scipy...@scipy.org

Hi Anne, interesting question.

I'm not really sure what a Bayesian hypothesis tester is, but I expect that
it results in a random variable. For a given input prior and measurement
distribution, and choice of hypothesis (signal present or signal absent),
can you know the distribution of this random variable? If so, it could come
down to a test that this random variable - or a function of it such as mean
or probability that it is greater than some value - behaves as expected.
How do you create a unit-test that a random variable generator is working?
If the random variables were all iid normal, you could average a bunch and
then test that the mean of the sample was close to the mean of the
distribution - which is going to be impossible to guarantee, since there is
a non-zero probability that the mean is large. In practice however it is
likely that your test will pass, but you will probably have to tack down the
seeds and make sure that the probability of failure is really small so
changing seeds (and hence the underlying sequence of "random" inputs) won't
likely cause a failure.

Is there anything in scipy's stats module to test that a series of random
variables has a given distribution? Maybe scipy.stats.kstest? Who the heck
are Kolmogorov and Smirnov anyway :-)?

--
View this message in context: http://old.nabble.com/Unit-testing-of-Bayesian-estimator-tp26240654p26241135.html
Sent from the Scipy-User mailing list archive at Nabble.com.

Bruce Southey

unread,

Nov 7, 2009, 10:23:25 PM11/7/09

to SciPy Users List

On Fri, Nov 6, 2009 at 6:13 PM, Anne Archibald
<aarc...@physics.mcgill.ca> wrote:
> Hi,
>
> I have implemented a simple Bayesian regression program (it takes
> events modulo one and returns a posterior probability that the data is
> phase-invariant plus a posterior distribution for two parameters
> (modulation fraction and phase) in case there is modulation).

I do not know your field, a little rusty on certain issues and I do
not consider myself a Bayesian.

Exactly what type of Bayesian did you use?
I also do not know how you implemented it especially if it is
empirical or Monte Carlo Markov Chains.

> I'm
> rather new at this, so I'd like to construct some unit tests. Does
> anyone have any suggestions on how to go about this?

Since this is a test, the theoretical 'correctness' is irrelevant. So
I would guess that you should use very informative priors and data
with a huge amount of information. That should make the posterior have
an extremely narrow range so your modal estimate is very close to the
true value within a very small range.

After that it really depends on the algorithm, the data used and what
you need to test. Basically you just have to say given this set of
inputs I get this 'result' that I consider reasonable. After all, if
the implementation of algorithm works then it is most likely the
inputs that are a problem. In statistics, problems usually enter
because the desired model can not be estimated from the provided data.
Separation of user errors from a bug in the code usually identified by
fitting simpler or alternative models.

Please do not mix Frequentist or Likelihood concepts with Bayesian.
Also you never generate data for estimation from the prior
distribution, you generate it from the posterior distribution as that
is what your estimating.

Really in Bayesian sense all this data generation is unnecessary
because you have already calculated that information in computing the
posteriors. The posterior of a parameter is a distribution not a
single number so you just compare distributions. For example, you can
compute modal values and construct Bayesian credible intervals of the
parameters. These should make very strong sense to the original values
simulated.

For Bayesian work, you must address the data and the priors. In
particular, you need to be careful about the informativeness of the
prior. You can get great results just because your prior was
sufficiently informative but you can get great results because you
data was very informative.

Depending on how it was implemented, a improper prior can be an issue
because these do not guarantee a proper posterior (but often do lead
to proper posteriors). So if your posterior is improper then you are
in a very bad situation and can lead to weird results some or all of
the time.Some times this is can easily be fixed such as by putting
bounds on flat priors. Whereas proper priors give proper posteriors.

But as a final comment, it should not matter which approach you use as
if you do not get what you simulated then either your code is wrong or
you did not simulate what your code implements. (Surprising how
frequent the latter is.)

Bruce

Anne Archibald

unread,

Nov 8, 2009, 2:14:37 AM11/8/09

to SciPy Users List

2009/11/6 <josef...@gmail.com>:

As far as getting the code roughly working, this is what I used; just
run it generating lots of photons and see that roughly the right
parameters come out. Unfortunately, this isn't really very sensitive
to all the things that are supposed to make a Bayesian estimator
better than (say) a maximum-likelihood estimator; I could have the
probability estimation pretty badly wrong, but there are so many
photons that anything but the right parameters are such a horrible fit
even a somewhat wrong algorithm won't select them.

Maybe you meant something different: I could also try fixing some
model parameters and generating just a handful of photons, so I get a
crummy estimate, but then repeating the photon generation and fit many
times, to see if the average value of the best-fit parameters comes
out close to the true parameters. But this is a test for unbiasedness
of the estimator, and it's not clear that this estimator should be
unbiased even if correct.

> Do you have an example with a known posterior? Then, the posterior
> with a large sample or the average in a Monte Carlo should still be
> approximately the true one.
> For symmetric distributions, the Bayesian posterior confidence
> intervals and posterior mean should be roughly the same as the
> frequentist estimates. With diffuse priors, in many cases the results
> are exactly the same in Bayesian and MLE.
> Another version I used in the past is to trace the posterior mean, as
> the prior variance is reduced, in one extreme you should get the prior
> back in the other extreme the MLE.

My priors are flat, on (0,1) in both phase and pulsed fraction. It
seems a bit peculiar to use anything else in phase, but I can imagine
some sort of logarithmic prior for pulsed fraction (making 0.01-0.1
equally likely to 0.1-1). I haven't experimented with introducing
localized priors, but it seems like that too wouldn't be very
sensitive to whether the Bayesian calculation is right; if the prior
insists that the values are both 0.5, then any remotely sane algorithm
will come up with posteriors that are also 0.5.

>> I can even generate models and then data
>> sets that are drawn from the prior distribution, but what should I
>> expect from the code output on such a data set?
>
> If you ignore the Bayesian interpretation, then this is just a
> standard sampling problem, you draw prior parameters and observations,
> the rest is just finding the conditional and marginal probabilities. I
> think the posterior odds ratio should converge in a large Monte Carlo
> to the true one, and the significance levels should correspond to the
> one that has been set for the test (5%).
> (In simplest case of conjugate priors, you can just interpret the
> prior as a previous sample and you are back to a frequentist
> explanation.)

This sounds like what I was trying for - draw a model according to the
priors, then generate a data set according to the model. I then get
some numbers out: the simplest is a probability that the model was
pulsed, but I can also get a credible interval or an estimated CDF for
the model parameters. But I'm trying to figure out what test I should
apply to those values to see if they make sense.

For a credible interval, I suppose I could take (say) a 95% credible
interval, then 95 times out of a hundred the model parameters I used
to generate the data set should be in the credible interval. And I
should be able to use the binomial distribution to put limits on how
close to 95% I should get in M trials. This seems to work, but I'm not
sure I understand why. The credible region is obtained from a
probability distribution for the model parameters, but I am turning
things around and testing the distribution of credible regions.

In any case, that seems to work, so now I just need to figure out a
similar test for the probability of being pulsed.

> The problem is that with an informative prior, you always have a
> biased estimator in small samples and the posterior odds ratio is
> affected by an informative prior. And "real" Bayesians don't care
> about sampling properties.
>
> What are your prior distributions and the likelihood function in your
> case? Can you model degenerate and diffuse priors, so that an
> informative prior doesn't influence you sampling results?
> I'm trying to think of special cases where you could remove the effect
> of the prior.

I've put the code on github in case that helps make this any clearer:
http://github.com/aarchiba/bayespf

The model I'm using is either: completely uniform mod 1 (probability
0.5) or: the PDF is a cosine plus a constant, where the two parameters
are the fraction of area under the cosine (as opposed to under the
constant) and the phase offset of the cosine. The likelihood is just
(modulo logs for range issues) the product over all observed phases x
of PDF(fraction, phase, x). So the mode of the posterior is exactly
the maximum-likelihood estimate (whether or not I got the math right,
more or less).

> It's a bit vague because I don't see the details, and I haven't looked
> at this in a while.

As is probably obvious, I'm pretty vague on Bayesian statistics in
general. But I'm working on it.

Anne

Anne Archibald

unread,

Nov 8, 2009, 2:25:04 AM11/8/09

to SciPy Users List

2009/11/7 Tom K. <t...@kraussfamily.org>:

>
> Hi Anne, interesting question.
>
> I'm not really sure what a Bayesian hypothesis tester is, but I expect that
> it results in a random variable. For a given input prior and measurement
> distribution, and choice of hypothesis (signal present or signal absent),
> can you know the distribution of this random variable? If so, it could come
> down to a test that this random variable - or a function of it such as mean
> or probability that it is greater than some value - behaves as expected.
> How do you create a unit-test that a random variable generator is working?
> If the random variables were all iid normal, you could average a bunch and
> then test that the mean of the sample was close to the mean of the
> distribution - which is going to be impossible to guarantee, since there is
> a non-zero probability that the mean is large. In practice however it is
> likely that your test will pass, but you will probably have to tack down the
> seeds and make sure that the probability of failure is really small so
> changing seeds (and hence the underlying sequence of "random" inputs) won't
> likely cause a failure.
>
> Is there anything in scipy's stats module to test that a series of random
> variables has a given distribution? Maybe scipy.stats.kstest? Who the heck
> are Kolmogorov and Smirnov anyway :-)?

Focusing on the hypothesis testing part, what my code does is take a
collection of photons and return the probability that they are drawn
from a pulsed distribution. My prior has two alternatives: not pulsed
(p=0.5) and pulsed (p=0.5, parameters randomly chosen). I feed this to
a Bayesian gizmo and get back a probability that the photons were
drawn from the former case.

In terms of testing, the very crude tests pass: if I give it a zillion
photons, it can correctly distinguish pulsed from unpulsed. But what
I'd like to test is whether the probability it returns is correct.
What I'd really like is some statistical test I can do on the
procedure to check whether the returned numbers are correct. Of
course, if I knew what distribution they were supposed to have, I
could just feed them to the K-S test. But I don't.

Part of the problem is that the data quality affects the result: if
feed in a zillion unpulsed photons, I get a variety of probabilities
that are close to zero - but how close is correct? I have no idea. If
I use pulsed photons, it is even more complicated: for a large pulsed
fraction, I'll get a variety of probabilities that are close to one.
But if I either reduce the number of photons or the pulsed fraction,
it gets harder to distinguish pulsed from unpulsed and the
probabilities start to drop. But I have no real idea how much.

Anne

Anne Archibald

unread,

Nov 8, 2009, 2:47:04 AM11/8/09

to SciPy Users List

2009/11/7 Bruce Southey <bsou...@gmail.com>:

> On Fri, Nov 6, 2009 at 6:13 PM, Anne Archibald
> <aarc...@physics.mcgill.ca> wrote:
>> Hi,
>>
>> I have implemented a simple Bayesian regression program (it takes
>> events modulo one and returns a posterior probability that the data is
>> phase-invariant plus a posterior distribution for two parameters
>> (modulation fraction and phase) in case there is modulation).
>
> I do not know your field, a little rusty on certain issues and I do
> not consider myself a Bayesian.
>
> Exactly what type of Bayesian did you use?
> I also do not know how you implemented it especially if it is
> empirical or Monte Carlo Markov Chains.

It's an ultra-simple toy problem, really: I did the numerical
integration in the absolute simplest way possible, by evaluating the
quantity to be evaluated on a grid and averaging. See github for
details:
http://github.com/aarchiba/bayespf

I can certainly improve on this, but I'd rather get my testing issues
sorted out first, so that I can test the tests, as it were, on an
implementation I'm reasonably confident is correct, before changing it
to a mathematically more subtle one.

>> I'm
>> rather new at this, so I'd like to construct some unit tests. Does
>> anyone have any suggestions on how to go about this?
>
> Since this is a test, the theoretical 'correctness' is irrelevant. So
> I would guess that you should use very informative priors and data
> with a huge amount of information. That should make the posterior have
> an extremely narrow range so your modal estimate is very close to the
> true value within a very small range.

This doesn't really test whether the estimator is doing a good job,
since if I throw mountains of information at it, even a rather badly
wrong implementation will eventually converge to the right answer.
(This is painful experience speaking.)

I disagree on the issue of theoretical correctness, though. The best
tests do exactly that: test the theoretical correctness of the routine
in question, ideally without any reference to the implementation. To
test the SVD, for example, you just test that the two matrices are
both orthogonal, and you test that multiplying them together with the
singular values between gives you your original matrix. If your
implementation passes this test, it is computing the SVD just fine, no
matter what it looks like inside.

With the frequentist signal-detection statistics I'm more familiar
with, I can write exactly this sort of test. I talk a little more
about it here:
http://lighthouseinthesky.blogspot.com/2009/11/testing-statistical-tests.html

This works too well, it turns out, to apply to scipy's K-S test or my
own Kuiper test, since their p-values are calculated rather
approximately, so they fail.

> After that it really depends on the algorithm, the data used and what
> you need to test. Basically you just have to say given this set of
> inputs I get this 'result' that I consider reasonable. After all, if
> the implementation of algorithm works then it is most likely the
> inputs that are a problem. In statistics, problems usually enter
> because the desired model can not be estimated from the provided data.
> Separation of user errors from a bug in the code usually identified by
> fitting simpler or alternative models.

It's exactly the implementation I don't trust, here. I can scrutinize
the implementation all I like, but I'd really like an independent
check on my calculations, and staring at the code won't get me that.

Um. I would be picking models from the prior distribution, not data.
However I find the models, I have a well-defined way to generate data
from the model.

Why do you say it's a bad idea to mix Bayesian and frequentist
approaches? It seems to me that as I use them to try to answer similar
questions, it makes sense to compare them; and since I know how to
test frequentist estimators, it's worth seeing whether I can cast
Bayesian estimators in frequentist terms, at least for testing
purposes.

> Really in Bayesian sense all this data generation is unnecessary
> because you have already calculated that information in computing the
> posteriors. The posterior of a parameter is a distribution not a
> single number so you just compare distributions. For example, you can
> compute modal values and construct Bayesian credible intervals of the
> parameters. These should make very strong sense to the original values
> simulated.

I take this to mean that I don't need to do simulations to get
credible intervals (while I normally would have to to get confidence
intervals), which I agree with. But this is a different question: I'm
talking about constructing a test by simulating the whole Bayesian
process and seeing whether it behaves as it should. The problem is
coming up with a sufficiently clear mathematical definition of
"should".

> For Bayesian work, you must address the data and the priors. In
> particular, you need to be careful about the informativeness of the
> prior. You can get great results just because your prior was
> sufficiently informative but you can get great results because you
> data was very informative.
>
> Depending on how it was implemented, a improper prior can be an issue
> because these do not guarantee a proper posterior (but often do lead
> to proper posteriors). So if your posterior is improper then you are
> in a very bad situation and can lead to weird results some or all of
> the time.Some times this is can easily be fixed such as by putting
> bounds on flat priors. Whereas proper priors give proper posteriors.

Indeed. I think my priors are pretty safe: 50% chance it's pulsed,
flat priors in phase and pulsed fraction. In the long run I might want
a slightly smarter prior on pulsed fraction, but for the moment I
think it's fine.

> But as a final comment, it should not matter which approach you use as
> if you do not get what you simulated then either your code is wrong or
> you did not simulate what your code implements. (Surprising how
> frequent the latter is.)

This is a bit misleading. If I use a (fairly) small number of photons,
and/or a fairly small pulsed fraction, I should be astonished if I got
back the model parameters exactly. I know already that the data leave
a lot of room for slop, so what I am trying to test is how well this
Bayesian gizmo quantifies that slop.

Anne

josef...@gmail.com

unread,

Nov 8, 2009, 5:14:58 PM11/8/09

to SciPy Users List

When I do a Monte Carlo for point estimates, I usually check bias,
variance, mean squared error,
and mean absolute and median absolute error (which is a more
robust to outliers, e.g. because for some cases the estimator produces
numerical nonsense because of non-convergence or other numerical
problems). MSE captures better cases of biased estimators that are
better in MSE sense.

I ran your test, test_bayes.py for M = 50, 500 and 1000 adding "return
in_interval_f"
and inside = test_credible_interval()

If my reading is correct inside should be 80% of M, and you are pretty close.
(M=1000 is pretty slow on my notebook)

>>> inside
39
>>> 39/50.
0.78000000000000003
>>>
>>> inside
410
>>> inside/500.
0.81999999999999995
>>>
>>> inside/1000.
0.81499999999999995

I haven't looked enough on the details yet, but I think this way you could
test more quantiles of the distribution, to see whether the posterior
distribution is roughly the same as the sampling distribution in the
MonteCarlo.

In each iteration of the Monte Carlo you get a full posterior distribution,
after a large number of iterations you have a sampling distribution,
and it should be possible to compare this distribution with the
posterior distributions. I'm still not sure how.

two questions to your algorithm

Isn't np.random.shuffle(r) redundant?
I didn't see anywhere were the sequence of observation in r would matter.

Why do you subtract mx in the loglikelihood function?
mx = np.amax(lpdf)
p = np.exp(lpdf - mx)/np.average(np.exp(lpdf-mx))

>
>> Do you have an example with a known posterior? Then, the posterior
>> with a large sample or the average in a Monte Carlo should still be
>> approximately the true one.
>> For symmetric distributions, the Bayesian posterior confidence
>> intervals and posterior mean should be roughly the same as the
>> frequentist estimates. With diffuse priors, in many cases the results
>> are exactly the same in Bayesian and MLE.
>> Another version I used in the past is to trace the posterior mean, as
>> the prior variance is reduced, in one extreme you should get the prior
>> back in the other extreme the MLE.
>
> My priors are flat, on (0,1) in both phase and pulsed fraction. It
> seems a bit peculiar to use anything else in phase, but I can imagine
> some sort of logarithmic prior for pulsed fraction (making 0.01-0.1
> equally likely to 0.1-1). I haven't experimented with introducing
> localized priors, but it seems like that too wouldn't be very
> sensitive to whether the Bayesian calculation is right; if the prior
> insists that the values are both 0.5, then any remotely sane algorithm
> will come up with posteriors that are also 0.5.

In the last sentence, you better hope that, if the true fraction is 0.1
than the posterior should be concentrated around 0.1 and not around
0.5. Right now you don't have an explicit prior, but once you use
one, you might want to test the effects of an informative prior.

For binomial (fraction) the natural prior is the beta distribution,
if I remember correctly. But I don't know if the marginal
posterior in this case would also be beta.

If you ignore the Bayesian belief interpretation, then it's just a
problem of Probability Theory, and you are just checking the
small and large sample behavior of an estimator and a test,
whether it has a Bayesian origin or not.

>
> In any case, that seems to work, so now I just need to figure out a
> similar test for the probability of being pulsed.

"probability of being pulsed"
I'm not sure what test you have in mind.
There are two interpretations:
In your current example, fraction is the fraction of observations that
are pulsed and fraction=0 is a zero probability event. So you cannot
really test fraction==0 versus fraction >0.

In the other interpretation you would have a prior probability (mass)
that your star is a pulsar with fraction >0 or a non-pulsing unit
with fraction=0.

The probabilities in both cases would be similar, but the interpretation
of the test would be different, and differ between frequentists and
Bayesians.

Overall your results look almost too "nice", with 8000 observations
you get a very narrow posterior in the plot.

Josef

Anne Archibald

unread,

Nov 8, 2009, 5:51:08 PM11/8/09

to SciPy Users List

2009/11/8 <josef...@gmail.com>:

> When I do a Monte Carlo for point estimates, I usually check bias,
> variance, mean squared error,
> and mean absolute and median absolute error (which is a more
> robust to outliers, e.g. because for some cases the estimator produces
> numerical nonsense because of non-convergence or other numerical
> problems). MSE captures better cases of biased estimators that are
> better in MSE sense.

I can certainly compute all these quantities from a collection of
Monte Carlo runs, but I don't have any idea what values would indicate
correctness, apart from "not too big".

> I ran your test, test_bayes.py for M = 50, 500 and 1000 adding "return
> in_interval_f"
> and inside = test_credible_interval()
>
> If my reading is correct inside should be 80% of M, and you are pretty close.
> (M=1000 is pretty slow on my notebook)

Yeah, that's the problem with using the world's simplest numerical
integration scheme.

>>>> inside
> 39
>>>> 39/50.
> 0.78000000000000003
>>>>
>>>> inside
> 410
>>>> inside/500.
> 0.81999999999999995
>>>>
>>>> inside/1000.
> 0.81499999999999995
>
> I haven't looked enough on the details yet, but I think this way you could
> test more quantiles of the distribution, to see whether the posterior
> distribution is roughly the same as the sampling distribution in the
> MonteCarlo.

I could test more quantiles, but I'm very distrustful of testing more
than one quantile per randomly-generated sample: they should be
covariant (if the 90% mark is too high, the 95% mark will almost
certainly be too high as well) and I don't know how to take that into
account. And running the test is currently so slow I'm inclined to
spend my CPU time on a stricter test of a single quantile. Though
unfortunately to increase the strictness I also need to improve the
sampling in phase and fraction.

> In each iteration of the Monte Carlo you get a full posterior distribution,
> after a large number of iterations you have a sampling distribution,
> and it should be possible to compare this distribution with the
> posterior distributions. I'm still not sure how.

I don't understand what you mean here. I do get a full posterior
distribution out of every simulation. But how would I combine these
different distributions, and what would the combined distribution
mean?

> two questions to your algorithm
>
> Isn't np.random.shuffle(r) redundant?
> I didn't see anywhere were the sequence of observation in r would matter.

It is technically redundant. But since the point of all this is that I
don't trust my code to be right, I want to make sure there's no way it
can "cheat" by taking advantage of the order. And in any case, the
slow part is my far-too-simple numerical integration scheme. I'm
pretty sure the phase integration, at least, could be done
analytically.

> Why do you subtract mx in the loglikelihood function?
> mx = np.amax(lpdf)
> p = np.exp(lpdf - mx)/np.average(np.exp(lpdf-mx))

This is to avoid overflows. I could just use logsumexp/logaddexp, but
that's not yet in numpy on any of the machines I regularly use. It has
no effect on the value, since it's subtracted from top and bottom
both, but it ensures that the largest value exponentiated is exactly
zero.

Indeed. But with frequentist tests, I have a clear statement of what
they're telling me that I can test against: "If you feed this test
pure noise you'll get a result this high with probability p". I
haven't figured out how to turn the p-value returned by this test into
something I can test against.

>> In any case, that seems to work, so now I just need to figure out a
>> similar test for the probability of being pulsed.
>
> "probability of being pulsed"
> I'm not sure what test you have in mind.
> There are two interpretations:
> In your current example, fraction is the fraction of observations that
> are pulsed and fraction=0 is a zero probability event. So you cannot
> really test fraction==0 versus fraction >0.
>
> In the other interpretation you would have a prior probability (mass)
> that your star is a pulsar with fraction >0 or a non-pulsing unit
> with fraction=0.

This is what the code currently implements: I begin with a 50% chance
the signal is unpulsed and a 50% chance the signal is pulsed with some
fraction >= 0.

> The probabilities in both cases would be similar, but the interpretation
> of the test would be different, and differ between frequentists and
> Bayesians.
>
> Overall your results look almost too "nice", with 8000 observations
> you get a very narrow posterior in the plot.

If you supply a fairly high pulsed fraction, it's indeed easy to tell
that it's pulsed with 8000 photons; the difficulty comes when you're
looking for a 10% pulsed fraction; it's much harder than 800 photons
with a 100% pulsed fraction. If I were really interested in the
many-photons case I'd want to think about a prior that made more sense
for really small fractions. But I'm keeping things simple for now.

josef...@gmail.com

unread,

Nov 8, 2009, 9:35:18 PM11/8/09

to SciPy Users List

On Sun, Nov 8, 2009 at 5:51 PM, Anne Archibald
<peridot...@gmail.com> wrote:
> 2009/11/8 <josef...@gmail.com>:
>
>> When I do a Monte Carlo for point estimates, I usually check bias,
>> variance, mean squared error,
>> and mean absolute and median absolute error (which is a more
>> robust to outliers, e.g. because for some cases the estimator produces
>> numerical nonsense because of non-convergence or other numerical
>> problems). MSE captures better cases of biased estimators that are
>> better in MSE sense.
>
> I can certainly compute all these quantities from a collection of
> Monte Carlo runs, but I don't have any idea what values would indicate
> correctness, apart from "not too big".

I consider them mainly as an absolute standard to see how well
an estimator works (or what the size and power of a test is) or
to compare them to other estimators, which is a common case for
publishing Monte Carlo studies for new estimators.

Adding additional quantiles might be relatively cheap, mainly the
call to searchsorted. One or two quantiles could be consistent
with many different distributions or e.g with fatter tails, so I usually
check more points.

>
>> In each iteration of the Monte Carlo you get a full posterior distribution,
>> after a large number of iterations you have a sampling distribution,
>> and it should be possible to compare this distribution with the
>> posterior distributions. I'm still not sure how.
>
> I don't understand what you mean here. I do get a full posterior
> distribution out of every simulation. But how would I combine these
> different distributions, and what would the combined distribution
> mean?

I'm still trying to think how this can be done. Checking more quantiles
as discussed above might be doing it to some extend.
(I also wonder whether it might be useful to fix the observations
during the monte carlo and only vary the sampling of the parameters ?)

What exactly are the null and the alternative hypothesis that you
want to test? This is still not clear to me, see also below.

>
>>> In any case, that seems to work, so now I just need to figure out a
>>> similar test for the probability of being pulsed.
>>
>> "probability of being pulsed"
>> I'm not sure what test you have in mind.
>> There are two interpretations:
>> In your current example, fraction is the fraction of observations that
>> are pulsed and fraction=0 is a zero probability event. So you cannot
>> really test fraction==0 versus fraction >0.
>>
>> In the other interpretation you would have a prior probability (mass)
>> that your star is a pulsar with fraction >0 or a non-pulsing unit
>> with fraction=0.
>
> This is what the code currently implements: I begin with a 50% chance
> the signal is unpulsed and a 50% chance the signal is pulsed with some
> fraction >= 0.

I don't see this

in generate you have m = np.random.binomial(n, fraction)
where m is the number of pulsed observations.

the probability of observing no pulsed observations is very
small
>>> stats.binom.pmf(0,100,0.05)
0.0059205292203340009

your likelihood function pdf_data_given_model
also treats each observation with equal fraction to be pulsed or not.

I would have expected something like
case1: pulsed according to binomial fraction, same as now
case2: no observations are pulsed
prior Prob(case1)=0.5, Prob(case2)=0.5

Or am I missing something?

Where is there a test? You have a good posterior distribution
for the fraction, which imply a point estimate and confidence interval,
which look good from the tests.
But I don't see a test hypothesis, (especially a Bayesian statement)

Josef

Anne Archibald

unread,

Nov 8, 2009, 10:22:36 PM11/8/09

to SciPy Users List

2009/11/8 <josef...@gmail.com>:

> On Sun, Nov 8, 2009 at 5:51 PM, Anne Archibald
> <peridot...@gmail.com> wrote:
>> 2009/11/8 <josef...@gmail.com>:
>>
>>> When I do a Monte Carlo for point estimates, I usually check bias,
>>> variance, mean squared error,
>>> and mean absolute and median absolute error (which is a more
>>> robust to outliers, e.g. because for some cases the estimator produces
>>> numerical nonsense because of non-convergence or other numerical
>>> problems). MSE captures better cases of biased estimators that are
>>> better in MSE sense.
>>
>> I can certainly compute all these quantities from a collection of
>> Monte Carlo runs, but I don't have any idea what values would indicate
>> correctness, apart from "not too big".
>
> I consider them mainly as an absolute standard to see how well
> an estimator works (or what the size and power of a test is) or
> to compare them to other estimators, which is a common case for
> publishing Monte Carlo studies for new estimators.

Ah. Yes, I will definitely be wanting to compute these at some point.
But first I'd like to make sure this estimator is doing what I want it
to.

>> I could test more quantiles, but I'm very distrustful of testing more
>> than one quantile per randomly-generated sample: they should be
>> covariant (if the 90% mark is too high, the 95% mark will almost
>> certainly be too high as well) and I don't know how to take that into
>> account. And running the test is currently so slow I'm inclined to
>> spend my CPU time on a stricter test of a single quantile. Though
>> unfortunately to increase the strictness I also need to improve the
>> sampling in phase and fraction.
>
> Adding additional quantiles might be relatively cheap, mainly the
> call to searchsorted. One or two quantiles could be consistent
> with many different distributions or e.g with fatter tails, so I usually
> check more points.

As I said, I'm concerned about using more than one credible interval
per simulation run, since the credible intervals for different
quantiles will be different sizes.

>>> In each iteration of the Monte Carlo you get a full posterior distribution,
>>> after a large number of iterations you have a sampling distribution,
>>> and it should be possible to compare this distribution with the
>>> posterior distributions. I'm still not sure how.
>>
>> I don't understand what you mean here. I do get a full posterior
>> distribution out of every simulation. But how would I combine these
>> different distributions, and what would the combined distribution
>> mean?
>
> I'm still trying to think how this can be done. Checking more quantiles
> as discussed above might be doing it to some extend.
> (I also wonder whether it might be useful to fix the observations
> during the monte carlo and only vary the sampling of the parameters ?)

I can see that fixing the data would be sort of nice, but it's not at
all clear to me what it would even mean to vary the model while
keeping the data constant - after all, the estimator has no access to
the model, only the data, so varying the model would have no effect on
the result returned.

>> Indeed. But with frequentist tests, I have a clear statement of what
>> they're telling me that I can test against: "If you feed this test
>> pure noise you'll get a result this high with probability p". I
>> haven't figured out how to turn the p-value returned by this test into
>> something I can test against.
>
> What exactly are the null and the alternative hypothesis that you
> want to test? This is still not clear to me, see also below.

Null hypothesis: no pulsations, all photons are drawn from a uniform
distributions.

Alternative: photons are drawn from a distribution with pulsed
fraction f and phase p.

>> This is what the code currently implements: I begin with a 50% chance
>> the signal is unpulsed and a 50% chance the signal is pulsed with some
>> fraction >= 0.
>
> I don't see this
>
> in generate you have m = np.random.binomial(n, fraction)
> where m is the number of pulsed observations.

Generate can be used to generate photons from a uniform distribution
by calling it with fraction set to zero. I don't actually do this
while testing the credible intervals, because (as I understand it)
the presence of this hypothesis does not affect the credible
intervals. That is, the credible intervals I'm testing are the
credible intervals assuming that the pulsations are real. I'm not at
all sure how to incorporate the alternative hypothesis into my
testing.

> the probability of observing no pulsed observations is very
> small
>>>> stats.binom.pmf(0,100,0.05)
> 0.0059205292203340009
>
> your likelihood function pdf_data_given_model
> also treats each observation with equal fraction to be pulsed or not.

pdf_data_given_model computes the PDF given a set of model parameters.
If you want to use it to get a likelihood with fraction=0, you can
call it with fraction=0. But this likelihood is always zero.

> I would have expected something like
> case1: pulsed according to binomial fraction, same as now
> case2: no observations are pulsed
> prior Prob(case1)=0.5, Prob(case2)=0.5
>
> Or am I missing something?
>
> Where is there a test? You have a good posterior distribution
> for the fraction, which imply a point estimate and confidence interval,
> which look good from the tests.
> But I don't see a test hypothesis, (especially a Bayesian statement)

When I came to implement it, the only place I actually needed to
mention the null hypothesis was in the calculation of the pulsed
probability, which is the last value returned by the inference
routine. I did make the somewhat peculiar choice to have the model PDF
returned normalized so that its total probability was one, rather than
scaling all points by the probability that the system is pulsed at
all.

It turns out that the inference code just computes the total
normalization S over all parameters; then the probability that the
signal is pulsed is S/(S+1).

josef...@gmail.com

unread,

Nov 9, 2009, 12:07:53 AM11/9/09

to SciPy Users List

On Sun, Nov 8, 2009 at 10:22 PM, Anne Archibald

Ok, I think I'm starting to see how this works, since you drop the
prior probabilities (0.5, 0.5) and the likelihood under the uniform
distribution is just 1, everything is pretty reduced form.

>From the posterior probability S/(S+1), you could construct
a decision rule similar to a classical test, e.g. accept null
if S/(S+1) < 0.95, and then construct a MonteCarlo
with samples drawn form either the uniform or the pulsed
distribution in the same way as for a classical test, and
verify that the decision mistakes, alpha and beta errors, in the
sample are close to the posterior probabilities.
The posterior probability would be similar to the p-value
in a classical test. If you want to balance alpha and
beta errors, a threshold S/(S+1)<0.5 would be more
appropriate, but for the unit tests it wouldn't matter.

Running the example a few times, it looks like that the power
is relatively low for distinguishing uniform distribution from
a pulsed distribution with fraction/binomial parameter 0.05
and sample size <1000.
If you have strong beliefs that the fraction is really this low
than an informative prior for the fraction, might improve the
results.

Josef

Bruce Southey

unread,

Nov 9, 2009, 12:18:41 PM11/9/09

to scipy...@scipy.org

On 11/08/2009 01:47 AM, Anne Archibald wrote:
> 2009/11/7 Bruce Southey<bsou...@gmail.com>:
>
>> On Fri, Nov 6, 2009 at 6:13 PM, Anne Archibald
>> <aarc...@physics.mcgill.ca> wrote:
>>
>>> Hi,
>>>
>>> I have implemented a simple Bayesian regression program (it takes
>>> events modulo one and returns a posterior probability that the data is
>>> phase-invariant plus a posterior distribution for two parameters
>>> (modulation fraction and phase) in case there is modulation).
>>>
>> I do not know your field, a little rusty on certain issues and I do
>> not consider myself a Bayesian.
>>
>> Exactly what type of Bayesian did you use?
>> I also do not know how you implemented it especially if it is
>> empirical or Monte Carlo Markov Chains.
>>
> It's an ultra-simple toy problem, really: I did the numerical
> integration in the absolute simplest way possible, by evaluating the
> quantity to be evaluated on a grid and averaging. See github for
> details:
> http://github.com/aarchiba/bayespf
>
> I can certainly improve on this, but I'd rather get my testing issues
> sorted out first, so that I can test the tests, as it were, on an
> implementation I'm reasonably confident is correct, before changing it
> to a mathematically more subtle one.
>

I do not know what you are trying to do with the code as it is not my
area. But you are using some empirical Bayesian estimator
(http://en.wikipedia.org/wiki/Empirical_Bayes_method) and thus you lose
much of the value of Bayesian as you are only dealing with modal
estimates. Really you should be obtaining the distribution of
"Probability the signal is pulsed" not just the modal estimate.

>>> I'm
>>> rather new at this, so I'd like to construct some unit tests. Does
>>> anyone have any suggestions on how to go about this?
>>>
>> Since this is a test, the theoretical 'correctness' is irrelevant. So
>> I would guess that you should use very informative priors and data
>> with a huge amount of information. That should make the posterior have
>> an extremely narrow range so your modal estimate is very close to the
>> true value within a very small range.
>>
> This doesn't really test whether the estimator is doing a good job,
> since if I throw mountains of information at it, even a rather badly
> wrong implementation will eventually converge to the right answer.
> (This is painful experience speaking.)
>

Are you testing the code or the method?
My understanding of unit tests is that they test the code not the
method. Unit tests tell me that my code is working correctly but do not
necessary tell me if the method is right always. For example, if I need
to iterate to get a solution, my test could stop after 1 or 2 rounds
before convergence because I know that rest will be correct if the first
rounds are correct.

Testing the algorithm is relatively easy because you just have to use
sensitivity analysis. Basically just use multiple data sets that vary in
the number of observations and parameters to see how well these work.
The hard part is making sense of the numbers.

Also note that you have some explicit assumptions involved like the type
of prior distribution. These tend to limit what you can do because if
these assume a uniform prior then you can not use a non-uniform data
set. Well you can but unless the data dominates the prior you will most
likely get a weird answer.

> I disagree on the issue of theoretical correctness, though. The best
> tests do exactly that: test the theoretical correctness of the routine
> in question, ideally without any reference to the implementation. To
> test the SVD, for example, you just test that the two matrices are
> both orthogonal, and you test that multiplying them together with the
> singular values between gives you your original matrix. If your
> implementation passes this test, it is computing the SVD just fine, no
> matter what it looks like inside.
>

I agree that code must provide theoretical correctness. But I disagree
that the code should always give the correct answer because the code is
only as good as the algorithm.

> With the frequentist signal-detection statistics I'm more familiar
> with, I can write exactly this sort of test. I talk a little more
> about it here:
> http://lighthouseinthesky.blogspot.com/2009/11/testing-statistical-tests.html
>
> This works too well, it turns out, to apply to scipy's K-S test or my
> own Kuiper test, since their p-values are calculated rather
> approximately, so they fail.
>

Again, this is a failure of the algorithm not the code. Often
statistical tests rely on large sample sizes or rely on the central
limit theorem so these break down when the data is not well approximated
by the normal distribution and when the sample size is small. In the
blog, your usage of the chi-squared approximation is an example of this
- it will be inappropriate for small sample size as well as when the
true probability is very extreme (usually consider it valid between
about 0.2 and 0.8 obviously depending on sample size).

>> After that it really depends on the algorithm, the data used and what
>> you need to test. Basically you just have to say given this set of
>> inputs I get this 'result' that I consider reasonable. After all, if
>> the implementation of algorithm works then it is most likely the
>> inputs that are a problem. In statistics, problems usually enter
>> because the desired model can not be estimated from the provided data.
>> Separation of user errors from a bug in the code usually identified by
>> fitting simpler or alternative models.
>>
> It's exactly the implementation I don't trust, here. I can scrutinize
> the implementation all I like, but I'd really like an independent
> check on my calculations, and staring at the code won't get me that.
>

I think that you mean is the algorithm and I do agree that looking at
the code will only tell you that you have implemented the algorithm and
will not tell you if the algorithm can be trusted.

This is the fundamental difference between Bayesian and Frequentist
approaches. In Bayesian, the posterior provides everything that you know
about a parameter because it is a distribution. However, the modal
parameter estimates should agree between both approaches.

>> Really in Bayesian sense all this data generation is unnecessary
>> because you have already calculated that information in computing the
>> posteriors. The posterior of a parameter is a distribution not a
>> single number so you just compare distributions. For example, you can
>> compute modal values and construct Bayesian credible intervals of the
>> parameters. These should make very strong sense to the original values
>> simulated.
>>
> I take this to mean that I don't need to do simulations to get
> credible intervals (while I normally would have to to get confidence
> intervals), which I agree with. But this is a different question: I'm
> talking about constructing a test by simulating the whole Bayesian
> process and seeing whether it behaves as it should. The problem is
> coming up with a sufficiently clear mathematical definition of
> "should".
>

In Bayesian, you should have the posterior distribution of the parameter
which is far more than just the mode, mean and variance. So if the
posterior is normal, then I know what the mean and variance should be
and thus what the confidence interval should be.

>> For Bayesian work, you must address the data and the priors. In
>> particular, you need to be careful about the informativeness of the
>> prior. You can get great results just because your prior was
>> sufficiently informative but you can get great results because you
>> data was very informative.
>>
>> Depending on how it was implemented, a improper prior can be an issue
>> because these do not guarantee a proper posterior (but often do lead
>> to proper posteriors). So if your posterior is improper then you are
>> in a very bad situation and can lead to weird results some or all of
>> the time.Some times this is can easily be fixed such as by putting
>> bounds on flat priors. Whereas proper priors give proper posteriors.
>>
> Indeed. I think my priors are pretty safe: 50% chance it's pulsed,
> flat priors in phase and pulsed fraction. In the long run I might want
> a slightly smarter prior on pulsed fraction, but for the moment I
> think it's fine.
>

It is not about whether or not the prior are 'safe'. Rather it is the
relative amount of information given.

Also, from other areas, I tend to distrust anything that assumes a 50:50
split because these often lead to special results that do not always
occur. So you think great, my code is working as everything looks fine.
Then everything crashes when you deviate from that assumption because
the 50% probability 'hides' something.
An example in my area, the formula (for genetic effect) is of the form:
alpha=2*p*q*(a + d(q-p))
where alpha, a and d are parameters and p and q are frequencies that add
to one.
We could mistakenly assume that a=alpha but that is only true if d=0 or
when p=q=0.5. So we may start to get incorrect results when p is not
equal to q and d is not zero.

>> But as a final comment, it should not matter which approach you use as
>> if you do not get what you simulated then either your code is wrong or
>> you did not simulate what your code implements. (Surprising how
>> frequent the latter is.)
>>
> This is a bit misleading. If I use a (fairly) small number of photons,
> and/or a fairly small pulsed fraction, I should be astonished if I got
> back the model parameters exactly. I know already that the data leave
> a lot of room for slop, so what I am trying to test is how well this
> Bayesian gizmo quantifies that slop.
>
> Anne
>
>

You should not be astonished to get the prior as that is exactly what
should happen if the data contains no information.

For your simplistic case, I would implore you to generate the posterior
distribution of the probability that the signal is pulsed (yes, easier
to say than do). If you can do that then you will not only you get the
modal value but you can compute the area under that distribution that is
say less than 0.5 or whatever threshold you want to use.

josef...@gmail.com

unread,

Nov 9, 2009, 1:02:38 PM11/9/09

to SciPy Users List

With the limitation of using only a single prior at the current stage, I think
Anne has already implemented the standard Bayesian estimation including
a nice graph of the joint posterior distribution of fraction and phase.

Whether the signal is pulsed or not is just a binary variable, and the posterior
belief is just the probability that it is pulsed. (I think that's correct, since
there is no additional level where the "population" distribution of pulsing
versus non-pulsing signals is a random variable.)

I think it is very useful to look at the sampling distribution of a Bayesian
estimator or test. For an individual Bayesian everything is summarized
in the posterior distribution, but how well are a 1000 Bayesian
econometricians/statisticians doing that look at a 1000 different stars,
especially if I'm not sure they programmed their code correctly?
At the end, it's only interesting if they have a lower MSE or a higher
power in the test, otherwise I just use a different algorithm.

I usually check and test the statistical properties of an algorithm
and not just whether it is correctly implemented. And I think Anne
is doing it in a similar way, checking whether the size of a
statistical test is ok. Of course, for applications where only
incorrectly sized (biased) tests are available, it is difficult to find
a good tightness of the unit tests.

Josef

Anne Archibald

unread,

Nov 9, 2009, 1:06:15 PM11/9/09

to SciPy Users List

2009/11/9 Bruce Southey <bsou...@gmail.com>:

> I do not know what you are trying to do with the code as it is not my
> area. But you are using some empirical Bayesian estimator
> (http://en.wikipedia.org/wiki/Empirical_Bayes_method) and thus you lose
> much of the value of Bayesian as you are only dealing with modal
> estimates. Really you should be obtaining the distribution of
> "Probability the signal is pulsed" not just the modal estimate.

Um. Given a data set and a prior, I just do Bayesian hypothesis
comparison. This gives me a single probability that the signal is
pulsed. You seem to be imagining a probability distribution for this
probability - but what would the independent variables be? The
unpulsed distribution does not depend on any parameters, and I have
integrated over all possible values for the pulsed distribution. So
what I get should really be the probability, given the data, that the
signal is pulsed. I'm not using an empirical Bayesian estimator; I'm
doing the numerical integrations directly (and inefficiently).

>> This doesn't really test whether the estimator is doing a good job,
>> since if I throw mountains of information at it, even a rather badly
>> wrong implementation will eventually converge to the right answer.
>> (This is painful experience speaking.)
>>
> Are you testing the code or the method?
> My understanding of unit tests is that they test the code not the
> method. Unit tests tell me that my code is working correctly but do not
> necessary tell me if the method is right always. For example, if I need
> to iterate to get a solution, my test could stop after 1 or 2 rounds
> before convergence because I know that rest will be correct if the first
> rounds are correct.

Unit tests can be used to do either. Since what I'm trying to do here
is make sure I understand Bayesian inference, I'm most worried about
the algorithm.

> Testing the algorithm is relatively easy because you just have to use
> sensitivity analysis. Basically just use multiple data sets that vary in
> the number of observations and parameters to see how well these work.
> The hard part is making sense of the numbers.

It is exactly how to make sense of the numbers that I'm asking about.

> Also note that you have some explicit assumptions involved like the type
> of prior distribution. These tend to limit what you can do because if
> these assume a uniform prior then you can not use a non-uniform data
> set. Well you can but unless the data dominates the prior you will most
> likely get a weird answer.

I don't understand what you mean by a "non-uniform data set".
Individual data sets are drawn from models, one of which is uniform.
The priors define the distribution of models; the priors I use give a
50% chance the model is uniform and a 50% chance the model is pulsed.

Anne

Anne Archibald

unread,

Nov 9, 2009, 1:14:36 PM11/9/09

to SciPy Users List

2009/11/9 <josef...@gmail.com>:

> >From the posterior probability S/(S+1), you could construct
> a decision rule similar to a classical test, e.g. accept null
> if S/(S+1) < 0.95, and then construct a MonteCarlo
> with samples drawn form either the uniform or the pulsed
> distribution in the same way as for a classical test, and
> verify that the decision mistakes, alpha and beta errors, in the
> sample are close to the posterior probabilities.
> The posterior probability would be similar to the p-value
> in a classical test. If you want to balance alpha and
> beta errors, a threshold S/(S+1)<0.5 would be more
> appropriate, but for the unit tests it wouldn't matter.

Unfortunately this doesn't work. Think of it this way: if my data size
is 10000 photons, and I'm looking at the fraction of
uniformly-distributed data sets that have a probability > 0.95 that
they are pulsed, this won't happen with 5% of my fake data sets - it
will almost never happen, since 10000 photons are enough to give a
very solid answer (experiment confirms this). So I can't interpret my
Bayesian probability as a frequentist probability of alpha error.

> Running the example a few times, it looks like that the power
> is relatively low for distinguishing uniform distribution from
> a pulsed distribution with fraction/binomial parameter 0.05
> and sample size <1000.
> If you have strong beliefs that the fraction is really this low
> than an informative prior for the fraction, might improve the
> results.

I really don't want to encourage my code to return reports of
pulsations. To be believed in this nest of frequentists I work with, I
need a solid detection in spite of very conservative priors.

josef...@gmail.com

unread,

Nov 9, 2009, 1:44:50 PM11/9/09

to SciPy Users List

On Mon, Nov 9, 2009 at 1:14 PM, Anne Archibald
<peridot...@gmail.com> wrote:
> 2009/11/9 <josef...@gmail.com>:
>
>> >From the posterior probability S/(S+1), you could construct
>> a decision rule similar to a classical test, e.g. accept null
>> if S/(S+1) < 0.95, and then construct a MonteCarlo
>> with samples drawn form either the uniform or the pulsed
>> distribution in the same way as for a classical test, and
>> verify that the decision mistakes, alpha and beta errors, in the
>> sample are close to the posterior probabilities.
>> The posterior probability would be similar to the p-value
>> in a classical test. If you want to balance alpha and
>> beta errors, a threshold S/(S+1)<0.5 would be more
>> appropriate, but for the unit tests it wouldn't matter.
>
> Unfortunately this doesn't work. Think of it this way: if my data size
> is 10000 photons, and I'm looking at the fraction of
> uniformly-distributed data sets that have a probability > 0.95 that
> they are pulsed, this won't happen with 5% of my fake data sets - it
> will almost never happen, since 10000 photons are enough to give a
> very solid answer (experiment confirms this). So I can't interpret my
> Bayesian probability as a frequentist probability of alpha error.

Doesn't this mean that the Bayesian posterior doesn't have the
correct tail probabilities? If my posterior beliefs are that the
probability that I make a mistake is 5% and I have the correct
model, but the real probability that I make a mistake is only 0.1%,
then my updating should have correctly taken into account that
the signal is so informative and tightened my posterior distribution.

With 8000 in your example, I get
Probability the signal is pulsed: 0.999960
This makes it pretty obvious if the signal is pulsed or not.

Do the tail probabilities work better for cases that are not so
easy to distinguish?

Josef

>
>> Running the example a few times, it looks like that the power
>> is relatively low for distinguishing uniform distribution from
>> a pulsed distribution with fraction/binomial parameter 0.05
>> and sample size <1000.
>> If you have strong beliefs that the fraction is really this low
>> than an informative prior for the fraction, might improve the
>> results.
>
> I really don't want to encourage my code to return reports of
> pulsations. To be believed in this nest of frequentists I work with, I
> need a solid detection in spite of very conservative priors.

When you have everything working, then you could check the
sensitivity to the priors. For parameter estimation, I found it
interesting to see which parameters change a lot when I
varied the prior variance, and it helps in the defense against
frequentists.

Josef

Bruce Southey

unread,

Nov 9, 2009, 2:47:10 PM11/9/09

to scipy...@scipy.org

On 11/09/2009 12:06 PM, Anne Archibald wrote:
> 2009/11/9 Bruce Southey<bsou...@gmail.com>:
>
>
>> I do not know what you are trying to do with the code as it is not my
>> area. But you are using some empirical Bayesian estimator
>> (http://en.wikipedia.org/wiki/Empirical_Bayes_method) and thus you lose
>> much of the value of Bayesian as you are only dealing with modal
>> estimates. Really you should be obtaining the distribution of
>> "Probability the signal is pulsed" not just the modal estimate.
>>
> Um. Given a data set and a prior, I just do Bayesian hypothesis
> comparison. This gives me a single probability that the signal is
> pulsed. You seem to be imagining a probability distribution for this
> probability - but what would the independent variables be? The
> unpulsed distribution does not depend on any parameters, and I have
> integrated over all possible values for the pulsed distribution. So
> what I get should really be the probability, given the data, that the
> signal is pulsed. I'm not using an empirical Bayesian estimator; I'm
> doing the numerical integrations directly (and inefficiently).
>

Here are two links on what I mean with reference to the binomial case:
http://lingpipe-blog.com/2009/09/11/batting-averages-bayesian-vs-mle-estimate/

TEACHING OF BAYESIAN ESTIMATION OF “P” PROBABILITY
IN A BERNOULLI PROCESS:
http://www.stat.auckland.ac.nz/~iase/publications/17/C439.pdf

I do not know your area but you should be able to do something similar.

Bruce

Anne Archibald

unread,

Nov 9, 2009, 4:28:19 PM11/9/09

to SciPy Users List

2009/11/9 Bruce Southey <bsou...@gmail.com>:

They are doing something essentially different from what I am doing.
They have a single (parameterized) hypothesis, so they don't compute a
probability of it being the case rather than some other hypothesis.
Perhaps you are being misled by the fact that the system they are
reasoning about is a binomial system, in which the parameter is
"probability of occurrence". In my case, I am not working with a
binomial system; the closest analog in my system to their p is my
fraction parameter, and I seem to have a usable way to test the
posterior distribution of this parameter. It is the hypothesis testing
that I am trying to test at the moment.

Anne

Johann Cohen-Tanugi

unread,

Nov 15, 2009, 4:02:38 PM11/15/09

to SciPy Users List

Anne, do you know of a python implementation of Lomb-Scargle?
Johann

Anne Archibald

unread,

Nov 15, 2009, 9:49:23 PM11/15/09

to co...@lpta.in2p3.fr, SciPy Users List

2009/11/15 Johann Cohen-Tanugi <co...@lpta.in2p3.fr>:

> Anne, do you know of a python implementation of Lomb-Scargle?

I don't, unfortunately. But as there seems to be no clever FFT-like
trick to it, it can probably be written in a few lines of numpy code -
simply take an array of frequencies and an array of times, broadcast
them together, and apply the formulas. If you have a lot of events or
a lot of frequencies, a loop over the smaller array will save a big
intermediate array, but beyond that I don't think there's much
cleverness to be put in.

Anne Archibald

unread,

Nov 15, 2009, 10:22:28 PM11/15/09

to SciPy Users List

Thank you everyone for all your comments. I have managed to pull
together a more-or-less satisfactory solution. If you're curious, I
have written up the problem at:
http://lighthouseinthesky.blogspot.com/2009/11/curve-fitting-part-3-bayesian-fitting.html
and my solution so far at:
http://lighthouseinthesky.blogspot.com/2009/11/curve-fitting-part-4-validating.html