I have a probability on several parameters and one of them is nuisance one. I want to margizalize over it
by simply integrating the probability along the whole range of values that such nuisance parameter is allowed to take.
Doing it that way is what people call using a flat prior? Is it what other people call a uniform prior? And, finally, is this the same thing as some others call "no prior"?
Thanks is advance,
Ruth Lazkoz
> I have a probability on several parameters and one of them is
> nuisance one. I want to margizalize over it by simply
> integrating the probability along the whole range of values
> that such nuisance parameter is allowed to take.
Okay good. You want to "integrate out a nuisance parameter."
> Doing it that way is what people call using a flat prior?
I think possibly you're combining two different issues here.
A "prior distribution" is a Bayesian term, and means, basically,
what you think the distribution of a parameter is before
(prior to) some additional data or evidence leads you to
modify your prior beliefs (i.e., produce an updated, or
posterior estimate of the distribution).
So I think we should just dispense with the word "prior"
here. It seems like all you really want to know is: given
that I don't know the shape of my nuisance parameter
distribution, what's a good guess?
Usually one looks to theory for this. Often one just assumes
a normal distribution. That follows when, for example,
the parameter reflects the joint influence of many different factors,
some positive and some negative, such that a bell-shape curve results.
I could be wrong, but I believe a "flat distribution" and a "uniform
distribution" would mean the same thing.
If you could give us some idea of the nuisance parameter, we might be
able
to make suggestions concerning its plausible shape.
Hope this helps.
--
John Uebersax PhD
That is an EXCELLENT description of what a Bayesian's
prior distribution is. A quantitative description and
summary of a Bayesian's personal opinion/belief of what
he KNOWS about a parameter, to the best of his
knowledge. That is why a TRUE Bayesian is very
SERIOUS about the assessment of his OWN prior, and
not willy-nilly claim ignorance simply because the
description is mathematically intratible or he does not
know how to ellicit his own prior and be able to describe
it in the form of a probability density function.
That is why Robert Schlaiffer had written an entire BOOK
discussing only the computer programs and routines to
help a Bayesian analyze a ONE parameter problem.
>
> So I think we should just dispense with the word "prior"
> here. It seems like all you really want to know is: given
> that I don't know the shape of my nuisance parameter
> distribution, what's a good guess?
That is a completely NON-Bayesian attitude.
>
> Usually one looks to theory for this. Often one just assumes
> a normal distribution. That follows when, for example,
> the parameter reflects the joint influence of many different factors,
> some positive and some negative, such that a bell-shape curve results.
That is not even a good NON-Bayesian approach. You don't
simply ASSUME anything has a normal distribution or any other
disribution. You use data to VALIDATE whether that is even
a reasonable assumption.
>
>
> I could be wrong, but I believe a "flat distribution" and a "uniform
> distribution" would mean the same thing.
You are wrong in your entire understanding of what Bayesian
Statistics is about. A flat distribution as used by pseudo-
Bayesians is NOT necessary a uniform distribution. A uniform
distribution cannot have infinite endpoints, for one thing.
>
> If you could give us some idea of the nuisance parameter, we might be
> able
> to make suggestions concerning its plausible shape.
>
> Hope this helps.
> --
> John Uebersax PhD
The OP asks a Bayesian question about uniform priors, flat priors,
uninformative priors, etc. and those are Bayesian concepts.
What you gave is a completely UNEDUCATED answer as well as
a completely inappropriate answer even as a NON-Bayesian.
It is times like this that discussant in sci.stat.math should keep
their mouths SHUT. Your advice to the OP (especially the
bit about "dispense with the word 'prior'" is at best an advice
TO malpractice Bayesian statistics because you don't know
anything about Bayesian statistics yourself.
-- Reef Fish Bob.
Bob
--
Bob O'Hara
Department of Mathematics and Statistics
P.O. Box 68 (Gustaf Hällströmin katu 2b)
FIN-00014 University of Helsinki
Finland
Telephone: +358-9-191 51479
Mobile: +358 50 599 0540
Fax: +358-9-191 51400
WWW: http://www.RNI.Helsinki.FI/~boh/
Journal of Negative Results - EEB: www.jnr-eeb.org
You are ALMOST correct. That's why I said Uebersax is NOT a
Bayesian. We already know Bob O'Hara isn't one. :-)
The posterior distribution is the likelihood function if the prior is
"diffuse" (which is NOT the same as a "uniform" or "flat" prior).
For Bayesian Inference on the parameter p of a Binomial distribution
or a Bernoulli Process, the beta distribution is a member of the
conjugate prior family -- meaning both the prior AND posterior
belongs to the same distribution family -- Beta.
The uniform distribution on (0,1) is a Beta distribution with
parameters (1,1) and is an INFORMATIVE prior.
Beta(1/2, 3) is reverse J-shaped.
Beta(1/2, 1/2) is U-shaped, symmetric around 1/2.
Beta(2, 2) is symmetric unimodal, so is Beta(2,3).
Beta(2,1) is the triangular distribution on (0,1)
Beta(3,2) is unimodal, skewed to the left.
Beta(3,1) is J-shaped, so is Beta(2. 1/3).
As you can see, the Beta family CAN represent a wide
variety of opinion about p and hence is a reasonably
good APPROXIMATE prior distribution for one to choose
that best-reflects one's opinion, without having to do any
work on integration of the product of the prior and likelihood,
because the posterior distribution form can be written
IMMEDIATELY given the sample information.
That's the usefulness of a CONJUGATE prior on certain
problems (for Bayesians). However, even the family of
conjugate priors are grossly inadequate for a true Bayesian
for expressing his opinion about a particular p of a
Binomial. That's why Robert Schlaiffer had spent a large
amount of time providing numerical assessment software
and numerical integration software for just that ONE
problem (and other uniariate parameter problems) and
had written a book about it.
-- Reef Fish Bob.
Sorry about Bob (Reef Fish). He's an embarassment to the
newsgroup--and, as is probably obvious should you look at his history
of posts, has the opposite of constructive motives.
What I said in my previous post is correct.
Cheers,
--
John Uebersax PhD
I was very explicit about what John Uebersax said was wrong about
Bayesian statistics:
JU> So I think we should just dispense with the word "prior"
JU> here.
RF> That is a completely NON-Bayesian attitude.
JU> Usually one looks to theory for this. Often one just assumes
JU> a normal distribution. That follows when, for example,
JU> the parameter reflects the joint influence of many different
factors,
JU> some positive and some negative, such that a bell-shape curve
results.
RF> That is not even a good NON-Bayesian approach. You don't
RF> simply ASSUME anything has a normal distribution or any other
RF> disribution. You use data to VALIDATE whether that is even
RF> a reasonable assumption.
RF> I could be wrong, but I believe a "flat distribution" and a
"uniform
RF> > distribution" would mean the same thing.
You WERE wrong! This is one of the most basic ideas in Bayesian
prior distribution that John Uebersax did not even know!!
RF> You are wrong in your entire understanding of what Bayesian
RF> Statistics is about. A flat distribution as used by pseudo-
RF> Bayesians is NOT necessary a uniform distribution. A uniform
RF> distribution cannot have infinite endpoints, for one thing.
JU> What I said in my previous post is correct.
What you said was completely WRONG, and you gave the OP the
worse statistical advice anyone could possibly have given.
Even Anon Bob O'Hara pointed out your error, while he made
his usual error himself which I corrected and provided the OP
the INFO about Bayesian priors relative to the UNIFORM prior:
RF> You are ALMOST correct. That's why I said Uebersax is NOT a
RF> Bayesian. We already know Bob O'Hara isn't one. :-)
RF> The posterior distribution is the likelihood function if the prior
is
RF> "diffuse" (which is NOT the same as a "uniform" or "flat" prior).
which pointed out BOTH John Uebersax and Anon O'Hara were wrong.
Then I proceeded to explain some related concepts and info about
the use the UNIFORM and other prior distributions in that post:
==== begin excerpt
==== end excerpt
You, John Uebersax, who hardly EVER had anything correct
or useful to post in sci.stat.math had the GALL to make this
unsubstantiated ad hominem statement about me, in this
particular instance of my correction of your ERRORS and
ill advice? :
> Sorry about Bob (Reef Fish). He's an embarassment to the
> newsgroup--and, as is probably obvious should you look at his history
> of posts, has the opposite of constructive motives.
Look at your OWN (John Uebersax') posting history:
Leaving The Catholic Church Is Not A Solution alt.atheism 3 hours
ago
Leaving The Catholic Church Is Not A Solution alt.atheism 3 hours
ago
different priors (flat, uniform, etc) sci.stat.math 3 hours ago
different priors (flat, uniform, etc) sci.stat.math 23 hours ago
Leaving The Catholic Church Is Not A Solution alt.atheism 23 hours
ago
Orthodoxy, postmodernity and the Emerging Church
soc.culture.south-africa 23 hours ago
Origen and reincarnation (followup) soc.history.ancient 25 hours
ago
The (infamous) regression and correlation discussion -- Summary
sci.stat.edu 3 days ago
{MEDSTATS} Correlation from disturbed data MedStats Oct 14
The current post was the one 3 hours ago. (nothing but ad hominem)
The post 23 hours ago was the one with the ERRORS and bad advice
The (infamous) regression 3 days ago was one of vacuous content
You should stick to your atheism and soc. culture and soc history
newsgroups, and leave out your POLLUTION of statistics groups!
People who live in glass houses shouldn't throw stones, John Uebersax!
JU> look at his history of posts,
This is the MOST RECENT history of MY posts (counted by Google)
511 messages sci.stat.math
61 messages sci.stat.edu
26 messages sci.math
14 messages sci.stat.consult
14 messages alt.sci.math.probability
Principal Component Analysis using R sci.stat.edu 27 minutes ago
My 300th post for the month of October sci.stat.math 7 hours ago
Testing for normality sci.stat.math 9 hours ago
Assessing credibility of a q-q plot by presence of outliers
sci.stat.math 9 hours ago
Principal Component Analysis using R sci.stat.edu 10 hours ago
Experienced Statistician to help decide whether a regression is legitim
sci.stat.math 10 hours ago
Experienced Statistician to help decide whether a regression is legitim
sci.stat.math 11 hours ago
(1 Typo correction) Re: Testing for normality sci.stat.math 15
hours ago
different priors (flat, uniform, etc) sci.stat.math 19 hours ago
Read those for STATISTICAL content and substance.
John Uebersax, you are just ANOTHER one of those who posts in
sci.stat.math who are ignorant in statistics and have nothing but
NOISE to post.
-- Reef Fish Bob.
As an example of why you're wrong, look at inference of the mean of a
normal distribution, with known variance, from a single data point. The
log-likelihood is
l(mu | x) = K - 0.5*tau*(x - mu)^2
where mu is the mean, x is the datum, and tau is the precision
(=1/variance), and K is a constant. This is a normal distribution with
mean x and precision tau.
As a diffuse prior, we could use a normal with mean mu_p and precision
tau_p, so that the log of the posterior is:
log(P(mu | x) = K_p - 0.5*(tau*(x - mu)^2 + tau_p*(mu - mu_p)^2)
After a bit of manipulation, you find that the posterior is normal with
mean (tau*mu + tau_p*mu_p)/(tau+tau_p) and precision tau+tau_p. If
tau_p>0, this is not the same as the likelihood, so even if tau_p is
small (and hence the prior is diffuse), as long as it's positive, it
contradicts what you wrote. If tau_p=0, the posterior and the
likelihood are the same, but now the prior is flat, so is not "diffuse".
I don't make mistakes like your, Anon Bob.
You snipped all of my beta prior examples. The UNIFORM prior
is beta(1,1).
You obviously DON'T KNOW the posterior for the Binomial p
inference with a beta prior.
Else you wouldn't have made your present round of NOISE.
> As an example of why you're wrong, look at inference of the mean of a
> normal distribution, with known variance, from a single data point. The
> log-likelihood is
>
> l(mu | x) = K - 0.5*tau*(x - mu)^2
That is entirely DIFFERENT problem from the Binomial p problem.
It has NOTHING to do with the Binomial p problem in which I showed
the UNIFORM prior being non-diffuse and INFORMATIVE.
You're copying the wrong recipe for the wrong problem.
> Bob O'Hara
Your FREE tuition had been withdrawn LONG ago.
-- Reef Fish Bob.
> For Bayesian Inference on the parameter p of a Binomial distribution
> or a Bernoulli Process, the beta distribution is a member of the
> conjugate prior family -- meaning both the prior AND posterior
> belongs to the same distribution family -- Beta.
>
> The uniform distribution on (0,1) is a Beta distribution with
> parameters (1,1) and is an INFORMATIVE prior.
Can we hear a bit more about how is Beta(1,1) is an informative prior for a
binomial problem?
--
Dwinsemius
It CHANGES the likelihood function to form the posterior distr.
It's in every Freshman textbook in Bayesian statistics.
-- Reef Fish Bob.
Some of us will never be Freshman again. How does a uniform prior on
(0,1) change L(.)?
Or perhaps an equivalent question: What would be an uninformative Beta
prior for B(.)?
--
David Winsemius
How true. It's too late for you to ask this kind of question about
Bayesian statistics. Go pick up ANY First Course textbook in
Bayesian statistics, and you'll find the answer there.
What you're asking is like someone who walks into this newsgroup
and ask, "what is a sample mean?"
>
> Or perhaps an equivalent question: What would be an uninformative Beta
> prior for B(.)?
Didn't remember you said LONG ago that you were glad you don't
have to listen to my free lecture any more?
Perhaps you, Bob O'Hara, and John Uebersax can form a little
study group and learn Bayesian inference about a Binomial p,
like a NON-Bayesian student learn what a Binomial distribution
is.
That's how LOW the LEVEL of you three are in Bayesian statistics.
Go to some other topics in which you know a little SOMETING.
I don't have the patience to teach kintergarden students in any
aspect of statistics.
-- Reef Fish Bob.
Both "flat" and "uniform" are used, with vague meanings.
Often priors are chosen because they are conjugate, or
because they are invariant.
Any real prior must come from the user. IF, and it is
a big if, the statistician can prove that it does not
make much difference over a wide range which includes
the user's prior, a simple prior, or even a non-Bayesian
procedure, can be a good choice. In most cases, this
is likely to require good mathematics.
The term "no prior" is meaningless.
--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558
>>> I have a probability on several parameters and one of them is
>>> nuisance one. I want to margizalize over it by simply
>>> integrating the probability along the whole range of values
>>> that such nuisance parameter is allowed to take.
>> Okay good. You want to "integrate out a nuisance parameter."
>>> Doing it that way is what people call using a flat prior?
>> I think possibly you're combining two different issues here.
>> A "prior distribution" is a Bayesian term, and means, basically,
>> what you think the distribution of a parameter is before
>> (prior to) some additional data or evidence leads you to
>> modify your prior beliefs (i.e., produce an updated, or
>> posterior estimate of the distribution).
>> So I think we should just dispense with the word "prior"
>> here. It seems like all you really want to know is: given
>> that I don't know the shape of my nuisance parameter
>> distribution, what's a good guess?
>Shouldn't that just come from the likelihood? i.e. it should be in the
>model already.
>Bob
In a very vague sense yes. Choosing the form of the
likelihood corresponds to putting prior probability
one on that form of the distribution. The question
in general is whether the robustness of the resulting
procedure is adequate to cover the deviation.
In many cases, there is good robustness for the
procedure, which need not be a true Bayes procedure.
This IS possible; the Gauss-Markov Theorem tells us
that using normality for the errors in a regression
is not likely to cost a great deal. The prior and
the loss come from the user, not the statistician,
and my foundational paper questions whether they
can be separated; computationally only the product
matters at all.
>Sorry about Bob (Reef Fish). He's an embarassment to the
>newsgroup--and, as is probably obvious should you look at his history
>of posts, has the opposite of constructive motives.
>What I said in my previous post is correct.
However, what he stated in his post on this topic is
VERY good and accurate. The one thing I take objection
to is his claim that conjugate priors are likely to
give good approximations; from the history of batting
averages, neither the prior distribution for a player
selected at random, or for a particular player over
time, is likely to resemble a beta distribution.
There is no such thing as an UNinformative prior.
To stick to this case, the two priors which have been
suggested as uninformative for the binomial are
Beta(.5, .5) and Beta(0,0). The improper prior
Beta(0,0) rarely gives any problem, and even the
theorems go through if the prior expected risk is
finite; it is neither necessary nor sufficient that
the prior measure be finiter
The problem with this interpretation is that any prior will have the
same effect, so there would be no such thing as a non-informative prior.
As non-informative priors do exist, and are discussed in the
literature, they do exist.
Non-informative priors are generally defined as priors which only add a
small amount of information, as compared to the likelihood. How does
the beta(1,1) shape up?
For the binomial, the likelihood (up to a normalising constant) is:
L(p| r) = p^n (1-p)^(N-n)
The pdf of a beta distribution is:
P(p) = K p^(alpha-1) (1-p)^(beta-1)
(where K is a normalising constant) so the posterior is
P(P|r) = K_p p^(n+alpha-1) (1-p)^(N-n+beta-1)
For a beta(1,1), this becomes:
P(P|r) = K_p p^(n) (1-p)^(N-n)
i.e. algebraically the same as the likelihood. In other words, it
doesn't add any information to the likelihood. This is pretty much
definitive of a "non-informative prior".
Intriguingly, Reef Fish also made this comment on this thread:
RF> The posterior distribution is the likelihood function if the prior
RF> is "diffuse" (which is NOT the same as a "uniform" or "flat" prior).
So, apparently the beta(1,1), which is also the uniform distribution, is
"diffuse" but not "uniform".
> It's in every Freshman textbook in Bayesian statistics.
>
Indeed.
Edwards, Lineman and Savage's "Bayesian Statistical Inference in
Psychological Research" has a Bernoullian example in which they use an
constant alternative prior (which from their context I am taking to be
"uninformative" alternative to their informative null prior with
Beta(r+1,N-r+1)).
> Didn't remember you said LONG ago that you were glad you don't
> have to listen to my free lecture any more?
My prior on that hypothesis has probability for truth less than 1/2.
Could I have said something like that after you tried to tell us that
the p-value for the null when the observed was X=0 in any discrete
sample space with X=0 as the most extreme outcome was going to equal
zero? I'm not saying it's impossible, but my memory was that you kicked
me out of the lecture hall for asking too many questions and failing to
be a good little "student."
--
David Winsemius
Thanks for your endorsement, Herman. Herman has been familiar
with the various schools of Bayesian statistics long before I learned
the from Savage and other Bayesians.
>The one thing I take objection
> to is his claim that conjugate priors are likely to
> give good approximations;
I think you overlooked my keyword "grossly inadequate" (see below).
I have never made such a statement. Perhaps you read TOO much
into my explanation of the beta prior for the binomial p that because
of the versatile shapes the beta family can represent that they are
often a reasonable approximation of one's prior for p. But I said
immediately after the beta shape-description section,
RF> That's the usefulness of a CONJUGATE prior on certain
RF> problems (for Bayesians). However, even the family of
RF> conjugate priors are grossly inadequate for a true Bayesian
RF> for expressing his opinion about a particular p of a
RF> Binomial.
-- Reef Fish Bob.
> In article <Xns986A780C6...@216.196.97.136>,
> David Winsemius <doe_...@comcast.n0T> wrote:
>>"Reef Fish" <large_nass...@yahoo.com> wrote in
>>news:1161965644.2...@i3g2000cwc.googlegroups.com:
>
>>> For Bayesian Inference on the parameter p of a Binomial distribution
>>> or a Bernoulli Process, the beta distribution is a member of the
>>> conjugate prior family -- meaning both the prior AND posterior
>>> belongs to the same distribution family -- Beta.
>
>>> The uniform distribution on (0,1) is a Beta distribution with
>>> parameters (1,1) and is an INFORMATIVE prior.
>
>>Can we hear a bit more about how is Beta(1,1) is an informative prior
>>for a binomial problem?
>
> There is no such thing as an UNinformative prior.
>
> To stick to this case, the two priors which have been
> suggested as uninformative for the binomial are
> Beta(.5, .5) and Beta(0,0). The improper prior
> Beta(0,0) rarely gives any problem, and even the
> theorems go through if the prior expected risk is
> finite; it is neither necessary nor sufficient that
> the prior measure be finiter
I have seen B(0.5,0.5) offered as a (relatively) uninformative prior.
Throwing in Beta(0,0) was disturbing, since it doesn't exist. Calling it
an "improper prior" did provide a useful search target. Here is a
discussion I found helpful.
Zhu, M, Lu A. "The Counter-intuitive Non-informative Prior for the
Bernoulli Family" Journal of Statistics Education V 12, No.2 (2004),
<http://www.amstat.org/publications/jse/v12n2/zhu.pdf>
I had wondered whether the prior should be weighted proportional to
1/var(data|distribution) which would be 1/npq for the binomial. The
Haldane prior mentioned by Zhu does just this, and Zhu & Lu suggest it is
pretty close to the limiting form of Beta(e,e).
--
David Winsemius
DeGroot's comment on Shafer's paper "Lindley's paradox" criticized the
idea that "diffuse" should mean equal probability for all parameter
values and that in the normal(m,s) case, "diffuse" implies, more
appropriatly, that for example m^2 might be large - that is, the
variance is large. Similarly, in the beta prior case, beta(1,1) is
uniform, but may not be diffuse enough, because as you let both a,b
starting from beta(a=1,b=1) go to zero, the variance increases.
My one-line response turned out to be more succinct and penetrating
than I had thought, because they is the KEY to any PROPER prior
that is informative!
> [...]
> > Intriguingly, Reef Fish also made this comment on this thread:
> > RF> The posterior distribution is the likelihood function if the prior
> > RF> is "diffuse" (which is NOT the same as a "uniform" or "flat" prior).
> >
> > So, apparently the beta(1,1), which is also the uniform distribution, is
> > "diffuse" but not "uniform".
No, the uniform distribution is hardly diffuse. It is uniform AND
informative,
as I had said before.
I was out of town on the weekend. I sort of took advantage of that to
see what Bayesians I can flush out of the wood work, to show, without
any doubt, that John Uebersax and Anon Bob O'Hara were definitely
NON-Bayesians and that they were completely wrong, as I had indicated
with the few hints I gave.
The first one to surface was Herman Rubin, who mentioned some points
others followed up on, but Herman misunderstood the statement I made
about "conjugate priors" (which I corrected this morning).
David Winsemius indicated he made SOME efforts to READ what's
relevant. When he showed that he read the Edward, Lindman, and
Savage paper, I was TEMPTED to explain to him what the score
was, since he wasn't being confrontational even though his original
post right after Anon Bob (even after my explanation) seemed to
indicate that he never read a Freshman's BOOK about Bayesian
inference, and he STILL hasn't, or else he would have solved the
mystery himself. So, I'll reveal the Da Vinci Code to him and
all when I get to his post in the afternoon, following up on Herman
Rubin's comments on his questions which I didn't answer.
Then DZ emerged. I think that pretty much exhausted ALL the
educated Bayesians in sci.stat.math. from what I can gather in
my reading this group for 1 1/2 years.
> DeGroot's comment on Shafer's paper "Lindley's paradox" criticized the
> idea that "diffuse" should mean equal probability for all parameter
> values and that in the normal(m,s) case, "diffuse" implies, more
> appropriatly, that for example m^2 might be large - that is, the
> variance is large.
That is one of the meanings of the term "diffuse", and the normal
example (with a normal likelihood) is a GOOD example to say that
you CANNOT have a uniform distribution over the entire real line!
But it says more than that. It's related to Savage's "principle of
stable estimation" which gave a very quantifiable meaning to the
meaning of diffuse in the sense of "locally uniform" over a
likelihood function that is very sharp.
I had used the slightly altered and simplify meaning of "diffuse"
prior to mean one that would leave the posterior exactly the
same as a normalized likelihood function, so that the non-Bayesian
MLE becomes the maximum point of the posterior for a Bayesian,
if the likelihood function and the unnormailized posterior coincide.
> Similarly, in the beta prior case, beta(1,1) is
> uniform, but may not be diffuse enough, because as you let both a,b
> starting from beta(a=1,b=1) go to zero, the variance increases.
That is one way to look at it. But THIS was what I pointing at, for
the Freshman textbook nobody seemed to have found for the Da
Vinci Code of the conjugate prior beta for the binomial p.
The CONJUGATE part means both the prior and the posterior
are members of the beta family. If the prior distribution of the
binomial p is beta( alpha, beta ), and r and (n-r) are the
powers of p and (1-p) in the likelihood function, then the
posterior parameters will be changed to (alpha + r) and
(beta + n - r), in the beta family!
This needs one more step of explanation to show why Anon
Bob O'Hara was looking at BOTH the beta(1,1) prior and the
likelihood function and STILL missed it! That was the proof
that Bob O'Hara had never seen that Freshman book either
or any book, on how to make a Bayesian inference of the
parameter p of a Bernoulli process or a Binomial distribution.
I hope SOME ONE can manage to find a Bayesian book
(the more elementary the better) and show us what happens
when a uniform prior Beta(1,1) is applied to the Binomial
problem of p given r successes and f failures, r + f = n.
Meanwhile, I'll take a short break before explaining it in my
reply to David Wisenmius's latest post of Sun, Oct 29 2006
1:53 pm, which contain both Herman Rubin's reply
yesterday, and a very relevant webpage provided by David.
Stay tuned.
-- Reef Fish Bob.
With fear of pouring gasoline on the fire, I'll mention that
_Statistical Inference and Prediction in Climatology: A
Bayesian Approach_ by Edward Epstein is as close as
I can come in my library. Chapter 3 treats Bernoulli
processes, and beta distributions as conjugate priors.
Interestingly (if I'm reading it correctly) he suggests using
r=n=0 as "vague" prior parameters. He acknowledges
that this gives a prior beta density that is undefined, but
writes, "Nevertheless, if we ignore this deficiency and
apply Eq, (3.5) using r'=n'=0 as prior parameters, then
the posterior parameters become r''=r and n''=n. The
posterior density, unlike the prior, is proper (its integral
converges) if r!=0 and r!=n. In other words, if we feign
"total ignorance" and then obtain a set of data with at
least one success and one failure, then the resulting
posterior density is a mathematically proper form...".
However, he thinks that generally a more informative
prior is almost always available to the knowledgeable
analyst.
Epstein then goes on to work out some examples with
more informative priors, but none specifically with a
Beta(1,1) prior. But if any of the readers are interested,
that's a reference on the subject, FWIW.
>
> Meanwhile, I'll take a short break before explaining it in my
> reply to David Wisenmius's latest post of Sun, Oct 29 2006
> 1:53 pm, which contain both Herman Rubin's reply
> yesterday, and a very relevant webpage provided by David.
>
> Stay tuned.
>
> -- Reef Fish Bob.
Cheers,
Russell
What gasoline? What fire? ;-) The only thing HOT are those
the came out of mouths of our NOISIEST posters.
I don't know who Epstein is, but as I said, ANY elementary
book will do, and I think you delivered.
> Chapter 3 treats Bernoulli
> processes, and beta distributions as conjugate priors.
> Interestingly (if I'm reading it correctly) he suggests using
> r=n=0 as "vague" prior parameters. He acknowledges
> that this gives a prior beta density that is undefined, but
> writes, "Nevertheless, if we ignore this deficiency and
> apply Eq, (3.5) using r'=n'=0 as prior parameters, then
> the posterior parameters become r''=r and n''=n.
He even had the primes according to the usual cookbook
conventions. The r and n denote the SAMPLE r and n,
those in the likelihood function. r' and n' denote the
parameters in the prior distribution Beta(r',n') rather than
alpha and beta. That's because then you have the
"no brainer" of using the Beta as the conjugate prior,
because the posterior is given by (double primes)
r" = r + r' and n" = n + n'.
Right there is your Da Vinci Code for this simple result!
That's why in order to get the non-informative prior so
that the posterior is the same as the likelihood function,
r' and n' must both be zero. The improper Beta(0,0) as
Herman and David mentioned.
> The
> posterior density, unlike the prior, is proper (its integral
> converges) if r!=0 and r!=n. In other words, if we feign
> "total ignorance" and then obtain a set of data with at
> least one success and one failure, then the resulting
> posterior density is a mathematically proper form...".
> However, he thinks that generally a more informative
> prior is almost always available to the knowledgeable
> analyst.
Of course. Even if one feels any p is as likely as another,
you have the UNIFORM, which is informative!
Your posterior will have parameters (r + 1) and (n + 1).
>
> Epstein then goes on to work out some examples with
> more informative priors, but none specifically with a
> Beta(1,1) prior. But if any of the readers are interested,
> that's a reference on the subject, FWIW.
>
> >
> > Meanwhile, I'll take a short break before explaining it in my
> > reply to David Wisenmius's latest post of Sun, Oct 29 2006
> > 1:53 pm, which contain both Herman Rubin's reply
> > yesterday, and a very relevant webpage provided by David.
Before I go there, as I had expected, any textbook would have
answered the question that the uniform is NOT noninformative.
The one little catch that tripped O'Hara, was that in order for
the POSTERIOR distribution to remain the original likelihood
function (for the NON-Bayesians), the original likelihood
function MUST be written as if it were a Beta so that when it
is combined with Beta(0,0), it'll still be a Beta which is the
"conjugate" part.
Now watch carefully. :-)
The NON-Bayesian likelihood for r successes out of n trials
is proportional to
p^r (1 - p)^(n-r)
Even O'Hara knew that, with a slight change of notation:
BO> L(p| r) = p^n (1-p)^(N-n)
But the L(p|r) is NOT in form of a Beta density! The kernel
of the Beta density has (alpha -1) and (beta - 1) in the
exponents!
That's why, in the form of a Beta density, the r and n of
the likelihood function must be parametrized as (r+1) and (n+1).
The posterior BETA, from the Beta Prior (alpha, beta) will
be Beta (r+1+ alpha, n+1+beta), which is why alpha and
beta need to be both ZERO for the posterior to be
Beta(r+1, n+1) which is the original likelihood function
p^r (1 - p)^(n-r)
Now you can go to my post requesting a Bayesian textbook
and pick out exactly where Bob O'Hara erred. Since he is
not a Bayesian nor Bayesian statistics trained, he had
trouble relating a likelihood function (which is NOT a density
in the PARAMETER) to a Bayesian distribution for prior and
posterior alike, that is a distribution of the PARAMETER of
interest, p in the Bernoulli case.
-- Reef Fish Bob.
David Winsemius wrote:
> hru...@odds.stat.purdue.edu (Herman Rubin) wrote in
> news:ei10f2$4j...@odds.stat.purdue.edu:
>
> > In article <Xns986A780C6...@216.196.97.136>,
> > David Winsemius <doe_...@comcast.n0T> wrote:
> >>"Reef Fish" <large_nass...@yahoo.com> wrote in
> >>news:1161965644.2...@i3g2000cwc.googlegroups.com:
> >
RF> For Bayesian Inference on the parameter p of a Binomial
distribution
RF> or a Bernoulli Process, the beta distribution is a member of the
RF> conjugate prior family -- meaning both the prior AND posterior
RF> belongs to the same distribution family -- Beta.
>
RF> The uniform distribution on (0,1) is a Beta distribution with
RF> parameters (1,1) and is an INFORMATIVE prior.
DW>Can we hear a bit more about how is Beta(1,1) is an informative
prior
DW>for a binomial problem?
HR> There is no such thing as an UNinformative prior.
That is overstating the case slightly by Herman, even though in the
strictest Bayesian sense, no one can be "completely ignorant" about
any parameter, or anything, even though I've used the term
"completely ignorant" about some posters myself. :-)
Beta(1,1) is the uniform that is conjugate and changes the LIKELIHOOD
to the posterior! It would need Beta(0,0) to keep the likelihood
function
unchanged! Here is the part that O'Hara missed.
Now we've seen how it changes and how EASY the change is, without
doing an integration of mathematical work that otherwise might be
involved:
Likelihood (sample); ro = r +1, no = n + 1 where r is the
number
of success in n; ro and no are the parameters in a Beta (ro,
no).
Prior: Beta (r' , n')
Posterior: Beta (r" = ro+r', n" =no+n')
HR> To stick to this case, the two priors which have been
HR> suggested as uninformative for the binomial are
HR> Beta(.5, .5) and Beta(0,0). The improper prior
HR> Beta(0,0) rarely gives any problem, and even the
HR> theorems go through if the prior expected risk is
HR> finite; it is neither necessary nor sufficient that
HR> the prior measure be finiter
Herman is correct, of course. We had already seen why the
Beta(0,0) as improper as it is, is NEEDED to preserve the
likelihood function as the posterior distribution.
I don't know why Zellner (1996) and Haldane got the names attached
to that improper prior . Beta(0,0) is also called the 1/(p(1-p))
prior
of Zellner (and Haldane).
DW> I have seen B(0.5,0.5) offered as a (relatively) uninformative
prior.
That's the prior proportional to 1/sqrt(p(1-p), sometimes called
Jeffrey's uninformative Rule for the binomial p.
in http://www.amstat.org/publications/jse/v12n2/zhu.pdf citing Gelman
(1995)
DW> Zhu, M, Lu A. "The Counter-intuitive Non-informative Prior for the
DW> Bernoulli Family" Journal of Statistics Education V 12, No.2
(2004),
DW> <http://www.amstat.org/publications/jse/v12n2/zhu.pdf>
DW> Throwing in Beta(0,0) was disturbing, since it doesn't exist.
Calling it
DW> an "improper prior" did provide a useful search target. Here is a
DW> discussion I found helpful. <URL above>
> --
> David Winsemius
Now that we have covered the "uniform prior", "uninformative prior",
the "conjugate prior", and that leaves the "diffuse prior" which is
actually close to the OP's term "no prior". In other words, that is
a prior that you don't need to DO anything with, and simply use the
likelihood function as if it were the posterior distribution -- as an
APPROXIMATION, when the likelihood function is very sharp at
a small local location so make it unnecessary to work hard at
assessing one's real prior.
Savage called this the "diffuse" prior, in conjunction with his
"principle of Stable estimation". This concept is discussed in
great detail in the Ewards, Lindman, and Savage (1963), "Bayesian
Statistical Inference for Psychological Research", Psychological
Review, 193-242.
A "must read" for every serious student of Applied Bayesian
Statistics.
-- Reef Fish Bob.
> However, what he stated in his post on this topic is
> VERY good and accurate.
While what he said may be accurate, it doesn't address the OPs main
question, which is how to intregrate out a nuisance parameter.
We've now had 25 replies, and, as far as I can see, mine was the only
one that addressed the main question. The poster, didn't want a
dissertation on priors. They want to know how to handle their specific
problem.
Herman, you've been a valuable contributor to these newsgroups for
years. However, if you can't see the issue with Bob's provocative
attitude, admit the problems it's causing, or take a stand to criticize
him, then I'm really surprised. We already have people talking about
leaving the groups (and others probably already have).
--
John Uebersax PhD
Two further points:
1. This is not a Bayesian problem. The OP only wants to integrate out
a nuisance parameter. This isn't a Bayesian problem, in part, because
there is no posterior distribution. There is no updating or revising.
If it's a Bayesian problem, then, using this notation for Bayes'
theorem:
P(A) P(B|A)
P(A|B) = -----------
P(B)
would you be so kind as to explain what B or P(B) are here.
2. Nearly all discussion has proceeded from the assumption that the
parameter is over the interval (0-1). Nothing in the OP suggested
that.
Again, Ruth, if you're following the discussion--sorry about all this.
Bob (Reef Fish) Ling is a notorious troller/flamer:
http://en.wikipedia.org/wiki/Internet_troll
and so lacking if personal insight that he seems to think trashing out
a newsgroup with 522 posts is a virtue.
--
John Uebersax PhD
You did not cite a single word of Herman Rubin. I was going to let
him or others comment on your point. He may still do, but meanwhile
allow me to comment on your comments (even though it was addressed
to Herman Rubin).
>
> Two further points:
>
> 1. This is not a Bayesian problem. The OP only wants to integrate out
> a nuisance parameter. This isn't a Bayesian problem, in part, because
> there is no posterior distribution. There is no updating or revising.
But it IS a Bayesian problem, and it certainly has a posterior
distribution!
Even an uninformative prior has a posterior distribution, for a
Bayesian.
For a uniform prior, no matter what the Bayesian problem is, there
WILL be updating and revising, as in the case of a Bernoulli p.
You're simply re-confirming your absence of knowledge about
Bayesian statistics. Perhaps that's why Professor Rubin did not
bother to comment further, after he had already said that everything
I posted (which John Uebersax said was wrong) was "VERY good
and accurate".
>
>
> If it's a Bayesian problem, then, using this notation for Bayes'
> theorem:
>
> P(A) P(B|A)
> P(A|B) = -----------
> P(B)
>
> would you be so kind as to explain what B or P(B) are here.
That is Bayes' Theorem. Your QUESTION is unworthy even
for a nonBayesian college Freshman.
>
> 2. Nearly all discussion has proceeded from the assumption that the
> parameter is over the interval (0-1). Nothing in the OP suggested
> that.
It was used as an ILLUSTRATION that the "uniform" distribution is
"informative". It doesn't have to be over the interval (0-1). If
a Bayesian uses a uniform distribution as a prior over ANY interval,
that prior distribution WILL be informative and WILL update the
likelihood to a posterior distribution.
>
> Again, Ruth, if you're following the discussion--sorry about all this.
> Bob (Reef Fish) Ling is a notorious troller/flamer:
> John Uebersax PhD
I had already commented on your libelous statements under a
separate subject.
-- Reef Fish Bob.
>>>> For Bayesian Inference on the parameter p of a Binomial distribution
>>>> or a Bernoulli Process, the beta distribution is a member of the
>>>> conjugate prior family -- meaning both the prior AND posterior
>>>> belongs to the same distribution family -- Beta.
>>>> The uniform distribution on (0,1) is a Beta distribution with
>>>> parameters (1,1) and is an INFORMATIVE prior.
>>> Can we hear a bit more about how is Beta(1,1) is an informative prior for a
>>> binomial problem?
>> It CHANGES the likelihood function to form the posterior distr.
>But what does this mean? I guess you could mean something similar to
>the way Fisher treated likelihood: he waved his Fiducial wand, and the
>conditioning magically reversed. Of course, the Bayesian version does
>this formally.
>The problem with this interpretation is that any prior will have the
>same effect, so there would be no such thing as a non-informative prior.
> As non-informative priors do exist, and are discussed in the
>literature, they do exist.
Is there such a thing as a non-informative prior? I see no
justification for such, and good reasons not to use such.
For some problems, invariant priors are used, with the best
invariant prior being the right invariant Haar measure for
the transformation group. Priors should be looked upon as
weight functions, rather than belief, and hence can have an
infinite integral. The usual argument given for invariant
priors is that if one has a location problem, it matters not
where the origin is located, or if one has a scale problem,
the units do not matter.
Now it is correct that the same results should be obtained
if the units are inches or meters, but this does not mean
that the inference should be the same if the numbers given
are the same. There are invariant problems in which there
are priors giving uniformly better results than invariant
priors, and these are not "unusual"; estimating the
covariance matrix of a multivariate normal is there already.
>Non-informative priors are generally defined as priors which only add a
>small amount of information, as compared to the likelihood. How does
>the beta(1,1) shape up?
What does this mean? If the sample size is large enough,
and the dimension is small enough, and the prior is "smooth",
it makes essentially no difference.
>For the binomial, the likelihood (up to a normalising constant) is:
>L(p| r) = p^n (1-p)^(N-n)
>The pdf of a beta distribution is:
>P(p) = K p^(alpha-1) (1-p)^(beta-1)
>(where K is a normalising constant) so the posterior is
>P(P|r) = K_p p^(n+alpha-1) (1-p)^(N-n+beta-1)
>For a beta(1,1), this becomes:
>P(P|r) = K_p p^(n) (1-p)^(N-n)
>i.e. algebraically the same as the likelihood. In other words, it
>doesn't add any information to the likelihood. This is pretty much
>definitive of a "non-informative prior".
So should one use a beta(1,1) or a beta(.5,.5) or a beta(0,0)?
This latter would use the density 1/(p - p^2), which is the
reciprocal of the information? This and its square root have
been suggested, and in the case of an invariant problem, will
automatically give an invariant procedure, which may be quite
bad throughout the parameter space.
>Intriguingly, Reef Fish also made this comment on this thread:
>RF> The posterior distribution is the likelihood function if the prior
>RF> is "diffuse" (which is NOT the same as a "uniform" or "flat" prior).
>So, apparently the beta(1,1), which is also the uniform distribution, is
>"diffuse" but not "uniform".
Diffuse is a term used where the procedure is not affected.
Given a reasonable sized sample, there are lots of diffuse
priors. If the observations are normal with variance 1 and
the prior density exp(-theta) is used, the estimator is
increased by the quantity 1/n, which is "negligible" with
respect to the standard error if n is at all large.
I have agreed with Bob on THIS issue; I have disagreed with
him on many others.
As for the choice of priors, I am afraid that a dissertation
is needed, as it is for any statistical problem. The user
needs to understand that HE (or she) must make assumptions;
the statistician should question on this, and may point out
which assumptions are important. I have been consistent in
pointing this out. A nuisance parameter may be easy to
integrate out, or it may not; I cannot tell in this case
from the information supplied.
Statistics is not a collection of algorithms which came
down from some Divine source, and the user just has to
pick an algorithm which looks like it will work for the
problem. The user needs to formulate the probability
model, and supply the loss-prior combination; functionally,
they cannot be separated, and if you read my foundational
approach, one cannot easily come up with a reason to
separate them.
>Two further points:
>1. This is not a Bayesian problem. The OP only wants to integrate out
>a nuisance parameter. This isn't a Bayesian problem, in part, because
>there is no posterior distribution. There is no updating or revising.
From the foundational viewpoint, everything is Bayesian.
>If it's a Bayesian problem, then, using this notation for Bayes'
>theorem:
> P(A) P(B|A)
>P(A|B) = -----------
> P(B)
>would you be so kind as to explain what B or P(B) are here.
One does not have to use only this elementary view of Bayes
Theorem. B is the event that the data occur; it is generally
of probability 0. However, one can construct a conditional
distribution P(A|H)(B), where H is the sigma-field of possible
outcomes for the data, and this is how it is normally used by
Bayesians. It is not precisely defined, but can be changed on
a set of measure 0; this is unavoidable.
Another thing which makes this reasonable is that the above
conditional probability distribution CAN be looked upon as
the limit of the conditional distributions given discrete
restrictions on H, where the formula you have given holds;
just spread it out as much as possible.
> For some problems, invariant priors are used, with the best
> invariant prior being the right invariant Haar measure for
> the transformation group. Priors should be looked upon as
> weight functions, rather than belief, and hence can have an
> infinite integral. The usual argument given for invariant
> priors is that if one has a location problem, it matters not
> where the origin is located, or if one has a scale problem,
> the units do not matter.
>
> Now it is correct that the same results should be obtained
> if the units are inches or meters, but this does not mean
> that the inference should be the same if the numbers given
> are the same. There are invariant problems in which there
> are priors giving uniformly better results than invariant
> priors, and these are not "unusual"; estimating the
> covariance matrix of a multivariate normal is there already.
>
>> Non-informative priors are generally defined as priors which only add a
>> small amount of information, as compared to the likelihood. How does
>> the beta(1,1) shape up?
>
> What does this mean? If the sample size is large enough,
> and the dimension is small enough, and the prior is "smooth",
> it makes essentially no difference.
>
Indeed: but of course that isn't always the case, and I was trying to
pin down a specific comment by Reef Fish.
>> For the binomial, the likelihood (up to a normalising constant) is:
>
>> L(p| r) = p^n (1-p)^(N-n)
>
>> The pdf of a beta distribution is:
>
>> P(p) = K p^(alpha-1) (1-p)^(beta-1)
>
>> (where K is a normalising constant) so the posterior is
>
>> P(P|r) = K_p p^(n+alpha-1) (1-p)^(N-n+beta-1)
>
>> For a beta(1,1), this becomes:
>
>> P(P|r) = K_p p^(n) (1-p)^(N-n)
>
>> i.e. algebraically the same as the likelihood. In other words, it
>> doesn't add any information to the likelihood. This is pretty much
>> definitive of a "non-informative prior".
>
> So should one use a beta(1,1) or a beta(.5,.5) or a beta(0,0)?
> This latter would use the density 1/(p - p^2), which is the
> reciprocal of the information? This and its square root have
> been suggested, and in the case of an invariant problem, will
> automatically give an invariant procedure, which may be quite
> bad throughout the parameter space.
>
So, the invariant approach may not be the best in all cases. I guess
almost any "non-informative", "vague", "objective" approach to
developing priors will break down in some circumstances.