Maximum Entropy Imputation

James A. Bowery

unread,

Mar 28, 2002, 10:33:24 PM3/28/02

to

I'm interested in locating fundamental work in maximum entropy imputation
for simple data tables.

I've done a Google search for papers on imputation via maximum entropy but
found virtually nothing except a paper from Russia which seems to be
concerned primarily with longitudinal statistics or process data.

John Bailey

unread,

Mar 29, 2002, 7:51:29 AM3/29/02

to

Maximum entropy is implied for any technique using Bayesina Inference,
nit wahr?

I somewhat casually selected these by screening a search using the
keyword *imputation*

Missing Data, Censored Data, and Multiple Imputation
http://cm.bell-labs.com/cm/ms/departments/sia/project/mi/index.html

Bayesian Statistics
http://cm.bell-labs.com/cm/ms/departments/sia/project/bayes/index.html

Multiple Imputation
http://www.stat.ucla.edu/~mhu/impute.html

"Multiple Imputation for Missing Data: Concepts and New Development"
http://www.sas.com/rnd/app/papers/multipleimputation.pdf

Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys,
New York: John Wiley & Sons, Inc.

Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, New
York: Chapman and Hall

http://www.sas.com/rnd/app/da/new/pdf/dami.pdf

Multiple Imputation References
http://www.statsol.ie/solas/sorefer.htm

John

Robert Ehrlich

unread,

Mar 29, 2002, 10:14:36 AM3/29/02

to

Perhaps something called the "Berg" method of maximum entropy might fill the
bill. It is used in signal processing.

Michael J Hardy

unread,

Mar 29, 2002, 3:27:04 PM3/29/02

to

John Bailey (jmb...@frontiernet.net) wrote:

> Maximum entropy is implied for any technique using Bayesina Inference,
> nit wahr?

No, I don't think so. Why would you say that? -- Mike Hardy

John Bailey

unread,

Mar 29, 2002, 6:18:08 PM3/29/02

to

On 29 Mar 2002 20:27:04 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

> John Bailey (jmb...@frontiernet.net) wrote:
>
>> Maximum entropy is implied for any technique using Bayesian Inference,

>> nit wahr?
>
>
> No, I don't think so. Why would you say that? -- Mike Hardy

Except for the No, I don't think so, I would have thought you were
being sarcastic, that I had stated the obvious.

Let me say it a different way. Are there statistical techniques for
obtaining maximum entropy estimates which are not Bayesian?

Are they sufficiently well known as to be suitable in a general search
of the web for references?

John

Henry

unread,

Mar 29, 2002, 6:46:33 PM3/29/02

to

On Fri, 29 Mar 2002 23:18:08 GMT, jmb...@frontiernet.net (John Bailey)
wrote:

>Let me say it a different way. Are there statistical techniques for
>obtaining maximum entropy estimates which are not Bayesian?
>
>Are they sufficiently well known as to be suitable in a general search
>of the web for references?

http://www-2.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html
barely mentions Bayesian techniques, except for a throwaway line that
Bayesians use "fuzzy maximum entropy".

I don't fully follow how maximum entropy works, but might it be
possible to use maximum liklihood techniques in some cases?

John Bailey

unread,

Mar 29, 2002, 9:33:09 PM3/29/02

to

On Fri, 29 Mar 2002 23:46:33 +0000 (UTC), se...@btinternet.com (Henry)
wrote:

from http://www.aas.org/publications/baas/v32n4/aas197/352.htm
A Maximum-Entropy Approach to Hypothesis Testing: An
Alternative to the p-Value Approach

P.A. Sturrock (Stanford University)
(quoting)
In problems of the Bernoulli type, an experiment or observation yields
a count of the number of occurrences of an event, and this count is
compared with what it to be expected on the basis of a specified and
unremarkable hypothesis. The goal is to determine whether the results
support the specified hypothesis, or whether they indicate that some
extraordinary process is at work. This evaluation is often based on
the ``p-value" test according to which one calculates, on the basis of
the specific hypothesis, the probability of obtaining the actual
result or a ``more extreme" result. Textbooks caution that the p-value
does not give the probability that the specific hypothesis is true,
and one recent textbook asserts ``Although that might be a more
interesting question to answer, there is no way to answer it."

The Bayesian approach does make it possible to answer this question.
As in any Bayesian analysis, it requires that we consider not just one
hypothesis but a complete set of hypotheses. This may be achieved very
simply by supplementing the specific hypothesis with the
maximum-entropy hypothesis that covers all other possibilities in
a way that is maximally non-committal. This procedure yields an
estimate of the probability that the specific hypothesis is true. This
estimate is found to be more conservative than that which one might
infer from the p-value test.
(end quote)

Michael J Hardy

unread,

Mar 30, 2002, 5:19:02 PM3/30/02

to

John Bailey (jmb...@frontiernet.net) wrote:

> Maximum entropy is implied for any technique using Bayesian Inference,
> nit wahr?

I answered:

> No, I don't think so. Why would you say that? -- Mike Hardy

He replied:

> Except for the No, I don't think so, I would have thought you were
> being sarcastic, that I had stated the obvious.

I think you're very confused.

> Let me say it a different way. Are there statistical techniques
> for obtaining maximum entropy estimates which are not Bayesian?

Oh God. *First* you said Maximum entropy is "implied for" any
technique using Bayesian inference; now you seem to be saying the
exact converse of that --- the other way around. I seldom see anyone
write less clearly.

Bayesianism is the belief in, or use of, a degree-of-belief
interpretation of probability. Your *first* posting seemed to say
for any technique using Bayesian inference, i.e., using a degree-of-
belief interpretation of probability, "maximum entropy is implied."
In fact it's perfectly routine to do Bayesian inference without ever
thinking about entropy at all. Your *next* posting seems to say,
*not* that Bayesian inference implies maximum likelihood, but the
reverse: that maximum likelihood implies Bayesian inference. Which
did you mean? Or did you mean both? Why can't you be clear about
that?

To answer your question: algorithms for obtaining estimates do
not require degree-of-belief interpretations of probability; they
don't require frequency interpretations; they don't require any
interpretations at all. So they don't have to be Bayesian. And I
doubt that any of them are in any way Bayesian, even if they are
relied on in doing Bayesian inference.

Mike Hardy

Radford Neal

unread,

Mar 30, 2002, 5:24:15 PM3/30/02

to

> John Bailey (jmb...@frontiernet.net) wrote:
>
>> Maximum entropy is implied for any technique using Bayesina Inference,
>> nit wahr?

No. Bayesian inference and maximum entropy methods as originally
defined are in fact incompatible. This is hardly surprising - if
you're setting probabilities by maximum entropy, it would be a big
coincidence if they turned out to be the same as what one would get by
a completely different method.

You're probably confused by the tendency of many of the old maximum
entropy advocates to claim that they were Bayesians. This just shows
that words and reality don't necessarily match.

More recently, the maximum entropy folks have pretty much abandoned
the old version of maximum entropy in favour of Bayesian methods using
priors that are defined in terms of entropy functions. This is
incompatible with the old maximum entropy methods. These priors may
be useful now and then, but there's no reason to limit yourself to them.

Radford Neal

John Bailey

unread,

Mar 30, 2002, 8:30:15 PM3/30/02

to

On 30 Mar 2002 22:19:02 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

> John Bailey (jmb...@frontiernet.net) wrote:
>> Let me say it a different way. Are there statistical techniques
>> for obtaining maximum entropy estimates which are not Bayesian?
> Oh God. *First* you said Maximum entropy is "implied for" any
>technique using Bayesian inference; now you seem to be saying the
>exact converse of that --- the other way around. I seldom see anyone
>write less clearly.

(that coming from someone who's first post was so obscure everyone
missed his point.)
(snipped)

> To answer your question: algorithms for obtaining estimates do
>not require degree-of-belief interpretations of probability; they
>don't require frequency interpretations; they don't require any
>interpretations at all. So they don't have to be Bayesian. And I
>doubt that any of them are in any way Bayesian, even if they are
>relied on in doing Bayesian inference.

I think the cause of the confusion is that my focus was the pragmatic
challenge of finding appropriate key words or phrases for an effective
web search and you are hung up on the religious aspects of Bayesian
vs frequentists theology.

Just in case anyone listening in has doubts about this aspect of a
perfectly innocent estimation technique I commend the following
presentation:
http://www.google.com/url?sa=U&start=22&q=http://umaxp1.physics.lsa.umich.edu/~kelly/bayes/intro_talk.ppt&e=933
Too bad its in Powerpoint but I recommend it anyway.
Here is a excerpt:
Bayesian/Frequentist results approach mathematical identity only if:
BPT uses priors with high degree of ignorance,
there are sufficient statistics, and
FPT distribution depends only on the sufficient statistic, and that it
is randomly distributed about the true value.
This convergence is seen as coincidental. (end quote)

John

Michael J Hardy

unread,

Mar 31, 2002, 3:18:04 PM3/31/02

to

John Bailey (jmb...@frontiernet.net) wrote:

> I think the cause of the confusion is that my focus was the pragmatic
> challenge of finding appropriate key words or phrases for an effective
> web search and you are hung up on the religious aspects of Bayesian
> vs frequentists theology.
>
> Just in case anyone listening in has doubts about this aspect of a
> perfectly innocent estimation technique I commend the following

I don't think anyone in this thread questioned any aspects of
any estimation technique.

Look: Most practitioners of Bayesian inference probably do not
know what entropy is. That appears to contradict what you said in
your posting that I first answered. Can you dispute that?

> presentation:
> http://www.google.com/url?sa=U&start=22&q=http://umaxp1.physics.lsa.umich.edu/~kelly/bayes/intro_talk.ppt&e=933
> Too bad its in Powerpoint but I recommend it anyway.
> Here is a excerpt:
> Bayesian/Frequentist results approach mathematical identity only if:
> BPT uses priors with high degree of ignorance,
> there are sufficient statistics, and
> FPT distribution depends only on the sufficient statistic, and that it
> is randomly distributed about the true value.
> This convergence is seen as coincidental. (end quote)

I have no idea what kind of software would be needed to read this
document, so at this point it's entirely illegible to me. What do you
mean by "BPT" and "FTP"?

Mike Hardy

John Bailey

unread,

Mar 31, 2002, 5:32:51 PM3/31/02

to

On 31 Mar 2002 20:18:04 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

> John Bailey (jmb...@frontiernet.net) wrote:
>
>> I think the cause of the confusion is that my focus was the pragmatic
>> challenge of finding appropriate key words or phrases for an effective
>> web search and you are hung up on the religious aspects of Bayesian
>> vs frequentists theology.

> Look: Most practitioners of Bayesian inference probably do not
>know what entropy is. That appears to contradict what you said in
>your posting that I first answered. Can you dispute that?
>

I will definitely dispute the first part. My first professional use
of Bayesian methodology was in 1960 using seminal work of C. K. Chow,
where it was indispensible for the final design of an Optical
Character Reader for RCA. My understanding of theory was updated in
the 80s by working with Myron Tribus, of Dartmouth fame and needing to
assimilate his use of maximum entropy methods as defined in his book
Rational Descriptions, Decisions and Designs. In that period we made
extensive use of Bayesian statistics in test design and interpretation
for high end Xerox reprographic machines. Ron Howard and Howard
Raiffa of Stanford were big guns who kept us on track in our
application of theory. I suppose there may be *practitioners of
Bayesian inference who are weak on the concept of entropy* but it is
clearly and unambiguously a part of the theory of its use.

Another worthwhile web reference I uncovered recently is:
http://xyz.lanl.gov/abs/hep-ph/9512295
Probability and Measurement Uncertainty in Physics - a Bayesian
Primer by G. D'Agostini (quoting from the abstract:)
The approach, although little known and usually misunderstood among
the High Energy Physics community, has become the standard way of
reasoning in several fields of research and has recently been adopted
by the international metrology organizations in their recommendations
for assessing measurement uncertainty. (end quote)

>>
http://www.google.com/url?sa=U&start=22&q=http://umaxp1.physics.lsa.umich.edu/~kelly/bayes/intro_talk.ppt&e=933

> I have no idea what kind of software would be needed to read this
>document, so at this point it's entirely illegible to me. What do you

>mean by "BPT" and "FPT"?

The document was posted from Microsoft Office presentation software
called Powerpoint. Its unfortunate that the document is not available
in a more neutral format, but I am sending you a print version
rendered by processing his presentation through adobe acrobat, pdf
format.

BPT is the authors shorthand for Bayesian Probablity Theory and FPT
is shorthand for Frequentist Probablity Theory

John

Michael J Hardy

unread,

Apr 1, 2002, 1:21:18 PM4/1/02

to

> > Look: Most practitioners of Bayesian inference probably do not
> >know what entropy is. That appears to contradict what you said in
> >your posting that I first answered. Can you dispute that?
>
>
> I will definitely dispute the first part. My first professional use
> of Bayesian methodology was in 1960 using seminal work of C. K. Chow,
> where it was indispensible for the final design of an Optical Character
> Reader for RCA. My understanding of theory was updated in the 80s by
> working with Myron Tribus, of Dartmouth fame and needing to assimilate
> his use of maximum entropy methods as defined in his book Rational
> Descriptions, Decisions and Designs. In that period we made extensive
> use of Bayesian statistics in test design and interpretation for high
> end Xerox reprographic machines. Ron Howard and Howard Raiffa of
> Stanford were big guns who kept us on track in our application of
> theory. I suppose there may be *practitioners of Bayesian inference
> who are weak on the concept of entropy* but it is clearly and
> unambiguously a part of the theory of its use.

I don't doubt that people you worked with are familiar with
entropy, nor that some people who do Bayesian inference use entropy,
but it is perfectly obvious that such familiarity is not needed in
order to do Bayesian inference. Why do you call it "clearly and
unambiguously a part of the theory of its use"?

Mike Hardy

Robert Ehrlich

unread,

Apr 1, 2002, 7:04:58 PM4/1/02

to

Sorry. In a recent post on this subject I mentioned "Berg's maximum entropy
method"
I was incorrect it is "Burg's maximum entropy method". This makes a
difference in that Berg is involved in the entropy / Bayes arguments but Burg
is not. Burg's insight concerns estiamtion of the amplitude of poorly
sampled low frequency phenomena and is used a lot in signal processing. It
has turned out in practice to be reasonably useful and robust even though the
assumptions are merely "plausible" rather than proven to be necessary and
sufficient. I have not kept up with the evolution of Burgs insights over the
past decade and would appreciate some comments on where it has all led.

John Bailey

unread,

Apr 1, 2002, 7:14:35 PM4/1/02

to

On 01 Apr 2002 18:21:18 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

>> > Look: Most practitioners of Bayesian inference probably do not
>> >know what entropy is. That appears to contradict what you said in
>> >your posting that I first answered. Can you dispute that?

In an earlier post, John Bailey's response to Hardy's statement was:

>> I will definitely dispute the first part.

>> I suppose there may be *practitioners of Bayesian inference
>> who are weak on the concept of entropy* but it is clearly and
>> unambiguously a part of the theory of its use.
>

Mike Hardy then replied:

> I don't doubt that people you worked with are familiar with
>entropy, nor that some people who do Bayesian inference use entropy,
>but it is perfectly obvious that such familiarity is not needed in
>order to do Bayesian inference. Why do you call it "clearly and
>unambiguously a part of the theory of its use"?

In my exposures to Bayesian methodology all have included a discussion
of how to determine a neutral Bayesian prior and the use of maximum
entropy as a means to that end.

John

James Beck

unread,

Apr 1, 2002, 10:53:55 PM4/1/02

to

mike:

since this thread seemed unusually aggressive and defensive, and since i am
a practioner of bayesian inference who had never heard "entropy" associated
with that practice, i found it sufficiently interesting to do a little
checking. none of my bayesian textbooks refer to entropy, at all. . . .huh.
it was, at least, a relief to know that i had not simply slept through a key
topic.

however, since that absence seemed strange--for a line of inquiry that could
be described by someone else as "clearly and unambiguously . . ."--i checked
a little more and found dozens of references to entropy in some of
my--regrettably ill-used--books on digital signals processing. entropy seems
particularly well-associated with optical signals compression,
decompression, reading, and reproduction specifically because there is a
high value assigned to maximum loss. for example, if i didn't have to
compress everything, i could potentially save a lot. there would be an
associated cost at decompression. that sounds like a field where one might
find some bayesians.

then i stopped to think, "none of my textbooks are called anything like
Rational Descriptions, Decisions, and Designs (Tribus)," either, so maybe i
was just thinking in the wrong part of the box. unfortunately, the book is
out of print, and sells used at amazon for $176. (makes me wonder what the
original price was. maybe i'll buy it anyway. i don't know of many used
textbooks that appreciate in price.)

it's hard to be sure, but i suspect that if you think in terms of rational
decision making, you will realize that there was a lot of merit, albeit
sensitive to context, in the other position. you may also find that you are
the perfect person to write the next "bridge" text on the use of bayesian
inference in decision making.

Michael J Hardy <mjh...@mit.edu> wrote in message
news:3ca8a51e$0$3940$b45e...@senator-bedfellow.mit.edu...

John Bailey

unread,

Apr 2, 2002, 8:58:43 AM4/2/02

to

On Tue, 02 Apr 2002 03:53:55 GMT, "James Beck"
<james....@verizon.net> wrote:

>mike:
>
>since this thread seemed unusually aggressive and defensive, and since i am
>a practioner of bayesian inference who had never heard "entropy" associated
>with that practice, i found it sufficiently interesting to do a little
>checking. none of my bayesian textbooks refer to entropy, at all. . . .huh.
>it was, at least, a relief to know that i had not simply slept through a key
>topic.

It's chapter 11 of Jaynes' book.
http://omega.albany.edu:8008/ETJ-PS/cc11g.ps

>then i stopped to think, "none of my textbooks are called anything like
>Rational Descriptions, Decisions, and Designs (Tribus)," either, so maybe i
>was just thinking in the wrong part of the box. unfortunately, the book is
>out of print, and sells used at amazon for $176. (makes me wonder what the
>original price was. maybe i'll buy it anyway. i don't know of many used
>textbooks that appreciate in price.)
>

The price of Tribus' text, published in 1969 (!) may be an indication
of how far ahead his thinking was or how little work went on in the
field until recently.

>it's hard to be sure, but i suspect that if you think in terms of rational
>decision making, you will realize that there was a lot of merit, albeit
>sensitive to context, in the other position. you may also find that you are
>the perfect person to write the next "bridge" text on the use of bayesian
>inference in decision making.

It does appear there is an information gap here. Information
arbitrage required?

Between Tribus' book (my copy of which I went to some lengths to
acquire after my first copy was borrowed and never returned), Ron
Howard's book (Dynamic Programming and Markov Processes) and Howard
Raiffa's book(Decision Analysis) it would be a lot of work to push
ahead into anything new. A quick review of
http://www-zeus.roma1.infn.it/~agostini/prob+stat.html including some
of his reprints at:
http://lanl.arXiv.org/find/physics/1/au:+DAgostini_G/0/1/0/all/0/1
suggests that Dagostini might be a good author for such a book.

Finally, I need to credit Carlos Rodriguez <car...@math.albany.edu>
for his
Maximum Entropy Online Resources
http://omega.albany.edu:8008/maxent.html

John
http://www.frontiernet.net/~jmb184

Carlos C. Rodriguez

unread,

Apr 3, 2002, 10:17:53 AM4/3/02

to

rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Mar30.1...@jarvis.cs.toronto.edu>...

Let me add some more heat, uncertainty, entropy and time to this
discussion...

I can easily envision myself wasting a google amount of time fighting
wind mills over the meaning of probability and entropy... so I'll be
brief.
Please go ahead, make my day and click me!....
http://omega.albany.edu:8008/

I know that Radford is a wff (well-(in)formed-fellow): Just look at
his 93 review of MCMC (e.g. http://omega.albany.edu:8008/neal.pdf).
BUT I TOTALLY disagree with his last paragraph:

> More recently, the maximum entropy folks have pretty much abandoned
> the old version of maximum entropy in favour of Bayesian methods using
> priors that are defined in terms of entropy functions. This is
> incompatible with the old maximum entropy methods. These priors may
> be useful now and then, but there's no reason to limit yourself to them.
>
> Radford Neal

By ME folks, he means it literally. By "pretty much
abandoned...functions.", he means
http://omega.albany.edu:8008/0201016.pdf

This is NOT incompatible with the old maximum entropy methods,
(just take alpha LARGE and maximum aposteriori becomes maximum entropy
the old fashion way).
Entropic priors are not only Re-volutionary, they are E-volutionary!

By "These priors may... to them". He means,

I want to be free to continue using my convenience priors so I will
continue ignoring the fact that entropic priors are maximally
non-commital with respect to missing information (thanks Ed!) but
just-in-case I'm missing something and
entropic priors are really as cool as you claim they are I'll keep
them arround.

As Jaynes discovered:
"First they'll say that it is wrong. Then they'll say that it is not
wrong but irrelevant. And finally they'll say that it is wright and
usefull but that they knew it long time ago"

Hiu Chung Law

unread,

Apr 3, 2002, 11:54:29 AM4/3/02

to

There are several ways to design uninformed priors, and maximum entropy
prior is one of them. So is maximum entropy prior superior to all other
kinds of uninformed priors in all applications?

Actually I know very little about maximum entropy. I have only glimpsed
through one book on maximum entropy method, and my former boss, whom I
regard as a Bayesian, never talked to me about maximum entropy.... and
I learn the Bayesian paradigm from him.

It would be nice if you can post some pointers (of tutorial type)
on ME method. Thank you.

Myron Tribus

unread,

Apr 3, 2002, 11:57:53 AM4/3/02

to

"James A. Bowery" <jim_b...@hotmail.com> wrote in message news:<ua7oas4...@corp.supernews.com>...

The book, "Rational Descriptions, Decisions and Designs" describes how
the principle of maximum entropy should be used in connection with
Bayes' Equation for a variety of problems in several fields. This
book was originally published in 1969 by Pergamon Press. It went out
of print some time after. A couple of years ago Expira of Sweden
issued a reprint. Amazon indicates that a used version may be
purchased for $175. Expira sells the reprinted version for less than
$50. Write to Hakan Sodersved <in...@expira.se> for detailed
information.
Myron Tribus mtr...@earthlink.net
350 Britto Terrace, Fremont, CA 94539
Ph: (510) 651 3641 Fax: (510) 656 9875
The establishment always rejects new ideas for it is
composed of people who, having found some of the truth yesterday
believe they possess all of it today. (E. T. Jaynes)

Herman Rubin

unread,

Apr 3, 2002, 12:58:23 PM4/3/02

to

In article <3ca8a51e$0$3940$b45e...@senator-bedfellow.mit.edu>,

> Mike Hardy

I agree with Mike. I consider the use of maximum entropy
to be an attempt to remove the prior from consideration,
and as such, it is only good if the results it gives are
similar.

Like other such methods as "non-informative" priors, etc.,
it is really anti-Bayesian. That something uses a formal
measure as a prior probability for reasons other than that
that measure is the user's prior, or gives good results for
the user's prior and loss function, does not justify it as
being a reasonable procedure.

--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

Herman Rubin

unread,

Apr 3, 2002, 1:32:09 PM4/3/02

to

In article <3ca8f7b5...@news.frontiernet.net>,

>John

Statistics is not methodology. Treating it as such causes
people to use totally inappropriate procedures.

The first thing is to state the problem, and stating a
mathematically convenient formulation can be worse than
useless. Bayesian reasoning requires that the USER be
the provider of the loss-prior combination. Now one
might want to use something simpler if it can be proved
to be reasonably good.

So we can use least squares without normality, as the
Gauss-Markov Theorem tells us that the results are just
about as good without normality as with. This is not
true for using mathematically convenient but inappropriate
priors. Also, it is not how well the prior is approximated,
but how well the solution is.

Bayesian priors should not be "neutral", unless it can
be shown that not much is lost by using such a prior.
Conjugate priors, "uninformative" priors, maximum entropy
priors, as such are unjustified computational copouts.

Radford Neal

unread,

Apr 3, 2002, 5:49:58 PM4/3/02

to

In article <c54f89f.02040...@posting.google.com>,

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>> More recently, the maximum entropy folks have pretty much abandoned
>> the old version of maximum entropy in favour of Bayesian methods using
>> priors that are defined in terms of entropy functions. This is
>> incompatible with the old maximum entropy methods. These priors may
>> be useful now and then, but there's no reason to limit yourself to them.
>>
>> Radford Neal

>By ME folks, he means it literally. By "pretty much
>abandoned...functions.", he means
>http://omega.albany.edu:8008/0201016.pdf

I've had a glance at this, though I can't say I've absorbed it all.
It does seem, however, that the example application to mixtures of
Gaussians produces rather strange results. According to equation
(78), the prior for the mean of a mixture component is more spread out
for rare components than for common components. Why would one want
this? Presumably, there's a problem somewhere where it's just the
right thing to do, but I don't think it's the right thing for most
problems. The argument that one should use this prior despite its
peculiar features because it is "maximally non-commital" in some sense
does not seem to me to be persuasive.

>This is NOT incompatible with the old maximum entropy methods,
>(just take alpha LARGE and maximum aposteriori becomes maximum entropy
>the old fashion way).

If I understand correctly, letting alpha go to infinity results in the
prior for the parameter being concentrated at a point. It was of
course always the case that if you found the maximum entropy
distribution and then specified your prior to be a point mass on this
distribution, then the methds were trivially "compatible". Once you
get into the details of how old "maximum entropy" methods actually
worked, however - such as how constraints on expectations were
obtained from sample means - it's clear that the way they produced a
result from the observed data is not compatible with the way a
Bayesian would produce a result by starting with a prior and
conditioning on observations.

Radford Neal

Carlos C. Rodriguez

unread,

Apr 4, 2002, 8:50:45 AM4/4/02

to

rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr3.1...@jarvis.cs.toronto.edu>...

> In article <c54f89f.02040...@posting.google.com>,
> Carlos C. Rodriguez <car...@math.albany.edu> wrote:
>
> >> More recently, the maximum entropy folks have pretty much abandoned
> >> the old version of maximum entropy in favour of Bayesian methods using
> >> priors that are defined in terms of entropy functions. This is
> >> incompatible with the old maximum entropy methods. These priors may
> >> be useful now and then, but there's no reason to limit yourself to them.
> >>
> >> Radford Neal
>
> >By ME folks, he means it literally. By "pretty much
> >abandoned...functions.", he means
> >http://omega.albany.edu:8008/0201016.pdf
>
> I've had a glance at this, though I can't say I've absorbed it all.
> It does seem, however, that the example application to mixtures of
> Gaussians produces rather strange results. According to equation
> (78), the prior for the mean of a mixture component is more spread out
> for rare components than for common components. Why would one want
> this? Presumably, there's a problem somewhere where it's just the
> right thing to do, but I don't think it's the right thing for most
> problems. The argument that one should use this prior despite its

> peculiar features because it is "maximally non-committal" in some sense

> does not seem to me to be persuasive.
>

Radford, that's not a bug. That's a feature!
Unlike Microsoft, I can prove it (Theorem 1).
In fact whatever property this prior has, it is a product of your own
ignorance. I don't mean that pejoratively I mean it logically. It is
that way because it is most difficult to discriminate from the
independent model between the data and the parameters. It is as blind
of the data as it can possibly be. It lets the data values speak for
themselves as much as it is mathematically possible.

When you say: "Presumably… for most problems" you are changing the
state of ignorance. If you realize that you do have more precise
information for the mean of the rare components in your particular
problem THEN you also realize that you FORGOT to include that
information either into h() or as a side condition. NOW, in the
absence of that information your best guess is to do as the entropic
prior says. That is a tautology yes. Maximum Entropy is tautological,
yes. But that again is not a bug. That's a feature not only of MaxEnt
but of mathematics in general.
By the way what I am saying is not new. Ed Jaynes lost his voice
screaming to the wind mills about it. Don't you agree Myron?
I know it sounds like religion and snake oil, like getting something
(The prior) from nothing (Ignorance) and for that reason many,
otherwise fine minds, have rejected the whole thing as they reject
biblical fundamentalists. As a friend of mine says: If you don't see
it I can not explain it to you!

> >This is NOT incompatible with the old maximum entropy methods,
> >(just take alpha LARGE and maximum aposteriori becomes maximum entropy
> >the old fashion way).
>
> If I understand correctly, letting alpha go to infinity results in the
> prior for the parameter being concentrated at a point. It was of
> course always the case that if you found the maximum entropy
> distribution and then specified your prior to be a point mass on this

> distribution, then the methods were trivially "compatible". Once you

> get into the details of how old "maximum entropy" methods actually
> worked, however - such as how constraints on expectations were
> obtained from sample means - it's clear that the way they produced a
> result from the observed data is not compatible with the way a
> Bayesian would produce a result by starting with a prior and
> conditioning on observations.
>
> Radford Neal

Again an old confusion… there is even a "Theorem" by a student of
Isaac Levi, the philosopher from Columbia University.
Take alpha very large (not just infinity) or very few data or no data
then the posterior is still not a point but completely dominated by
entropy so maximum a posteriori equals maximum entropy.

Radford Neal

unread,

Apr 4, 2002, 2:57:55 PM4/4/02

to

Radford Neal:

>> >http://omega.albany.edu:8008/0201016.pdf
>>
>> I've had a glance at this, though I can't say I've absorbed it all.
>> It does seem, however, that the example application to mixtures of
>> Gaussians produces rather strange results. According to equation
>> (78), the prior for the mean of a mixture component is more spread out
>> for rare components than for common components. Why would one want
>> this? Presumably, there's a problem somewhere where it's just the
>> right thing to do, but I don't think it's the right thing for most
>> problems. The argument that one should use this prior despite its
>> peculiar features because it is "maximally non-committal" in some sense
>> does not seem to me to be persuasive.
>>

Carlos C. Rodriguez <car...@math.albany.edu>:

>Radford, that's not a bug. That's a feature!
>Unlike Microsoft, I can prove it (Theorem 1).
>In fact whatever property this prior has, it is a product of your own
>ignorance. I don't mean that pejoratively I mean it logically. It is
>that way because it is most difficult to discriminate from the
>independent model between the data and the parameters. It is as blind
>of the data as it can possibly be. It lets the data values speak for
>themselves as much as it is mathematically possible.

Consider the problem in an example context: You are interested in
how far beetles travel during a day. With really advanced satellite
observation, you can track beetles flying around, but you can't
identify the species of beetle. You know there are five species of
beetle in a certain forest for which you have data. You therefore
model the distribution of distance travelled in a day as a mixture
of five normal distributions.

Suppose we don't know much about how common the different species are,
or how much the beetles travel in a day - the situation to which you
say your method applies.

The result of your method is a prior which says that the less common
beetles are likely to travel very far in a day, or not very far at
all, whereas the more common beetles are likely to travel a more
moderate distance. This seems to drastically depart from a prior that
embodies no precise information. It seems to correspond to a very
specific biological theory claiming that rare species have to either
travel a lot in a day (to avoid being set upon by gangs of competing
beatles?), or alternatively, to stay put. In no way can I accept that
this is a prior that will "let the data values speak for themselves".

>Again an old confusion. There is even a "Theorem" by a student of

>Isaac Levi, the philosopher from Columbia University.
>Take alpha very large (not just infinity) or very few data or no data
>then the posterior is still not a point but completely dominated by
>entropy so maximum a posteriori equals maximum entropy.

Maximum a posteriori estimation is not Bayesian.

Radford Neal

Michael J Hardy

unread,

Apr 4, 2002, 5:06:01 PM4/4/02

to

Herman Rubin (hru...@odds.stat.purdue.edu) wrote:

> I agree with Mike. I consider the use of maximum entropy
> to be an attempt to remove the prior from consideration,
> and as such, it is only good if the results it gives are
> similar.
>
> Like other such methods as "non-informative" priors, etc.,
> it is really anti-Bayesian.

Actually, I don't think a non-informative prior is inappropriate
to a situtation in which the person doing inference actually lacks
information. -- Mike Hardy

Carlos C. Rodriguez

unread,

Apr 4, 2002, 11:39:03 PM4/4/02

to

rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr4.1...@jarvis.cs.toronto.edu>...

Nice example. Wrong interpretation.
First of all, you can't quarrel with a theorem. The entropic prior for
the parameters of the mixture i.e. for the means, sds and weights is
proven to be the most difficult to discriminate from an independent
model on the space (data,parameters). Thus, in the absence of all
other information, WHATEVER PROPERTY THIS PRIOR HAS IS THE PROPERTY
THAT IT HAS TO HAVE in order to be the most ignorant about the data.
That's the beauty of mathematics. Once you accept the proof of Theorem
1 you are stuck with it. But that's not bad. That's the power of math.
Now you can go ahead and use the prior in 14 dimensional space without
having to worry about biasing the inferences with unjustified
assumptions. That's essentially the same reason why statistical
mechanics is so successful, as discovered long time ago by our beloved
guru E.T. (phone home) Jaynes and still, after all these years, unable
to be understood even by so reputable a wiff (well-in-formed-fellow)
as yourself who by the way even presented the problem of estimation of
mixtures with an infinite number of components at one of the MaxEnt
workshops.

OK back to the specifics of your gedankenexperiment. All the prior is
saying is that, in the absence of all other information, the means of
the rare components should be considered more uncertain than the means
of the common components. You may not like that but you have to live
with it. It doesn't matter weather you or I or anyone likes it or not.
If you say, for example: "what the heck I feel intuitively that an
ignorant prior should assign equal uncertainties to all the means
independently of the weights". Then Theorem 1 will tell you that your
intuitive feeling is a superstition. By the way, uncommon components
are observed less often than common ones so more a priori uncertainty
for the mean sounds good to me, again in the absence of all other
information.

Radford Neal

unread,

Apr 5, 2002, 9:48:03 AM4/5/02

to

>> Radford Neal:

>> Consider the problem in an example context: You are interested in
>> how far beetles travel during a day. With really advanced satellite
>> observation, you can track beetles flying around, but you can't
>> identify the species of beetle. You know there are five species of
>> beetle in a certain forest for which you have data. You therefore
>> model the distribution of distance travelled in a day as a mixture
>> of five normal distributions.
>>
>> Suppose we don't know much about how common the different species are,
>> or how much the beetles travel in a day - the situation to which you
>> say your method applies.
>>
>> The result of your method is a prior which says that the less common
>> beetles are likely to travel very far in a day, or not very far at
>> all, whereas the more common beetles are likely to travel a more
>> moderate distance. This seems to drastically depart from a prior that
>> embodies no precise information. It seems to correspond to a very
>> specific biological theory claiming that rare species have to either
>> travel a lot in a day (to avoid being set upon by gangs of competing
>> beatles?), or alternatively, to stay put. In no way can I accept that
>> this is a prior that will "let the data values speak for themselves".
>>

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>Nice example. Wrong interpretation.
>First of all, you can't quarrel with a theorem. The entropic prior for
>the parameters of the mixture i.e. for the means, sds and weights is
>proven to be the most difficult to discriminate from an independent
>model on the space (data,parameters). Thus, in the absence of all
>other information, WHATEVER PROPERTY THIS PRIOR HAS IS THE PROPERTY
>THAT IT HAS TO HAVE in order to be the most ignorant about the data.
>That's the beauty of mathematics. Once you accept the proof of Theorem
>1 you are stuck with it.

Why should I want a prior that is "most difficult to discriminate from
an independent model"? Or one that is "most ignorant about the data"?
And assuming I did want these things, why should I accept that your
mathematical formulation of what it means to be "ignorant" is the
correct one? These are not mathematical questions which can be
settled by a proof.

>OK back to the specifics of your gedankenexperiment. All the prior is
>saying is that, in the absence of all other information, the means of
>the rare components should be considered more uncertain than the means
>of the common components. You may not like that but you have to live
>with it. It doesn't matter weather you or I or anyone likes it or not.
>If you say, for example: "what the heck I feel intuitively that an
>ignorant prior should assign equal uncertainties to all the means
>independently of the weights". Then Theorem 1 will tell you that your
>intuitive feeling is a superstition.

No, what Theorem 1 tells ME is that your concept of "ignorance" is
flawed. Mathematical formulations of such concepts have to be tested
by checking their consequences in situations where intuitions are
clear. (After all, how else could one verify that the formulation is
correct?) Your formulation fails this test in this example.

>By the way, uncommon components
>are observed less often than common ones so more a priori uncertainty
>for the mean sounds good to me, again in the absence of all other
>information.

That is a good reason why the POSTERIOR uncertainty in the means of
the rare components will be greater. Why you think that this natural
effect of not having much data should be increased by also increasing
the PRIOR uncertainty is a mystery to me.

Radford Neal

Carlos C. Rodriguez

unread,

Apr 5, 2002, 10:21:44 AM4/5/02

to

Let me summarize the discussion. We have:
1) John Bailey: http://www.frontiernet.net/~jmb184/
2) Mike Hardy: http://www-math.mit.edu/~hardy/
3) Herman Rubin: http://www.stat.purdue.edu/people/hrubin/

Bailey: Entropy is an important concept in Bayesian Inference.
Hardy: Few people working in Bayesian Inference care about Entropy.
Rubin: The people that use entropy or whatever other so called
"neutral" priors are using unjustified computational copouts.

My position:
1) Hurray for Bailey!
2) Sure Mike but they should know better.
3) I disagree with Rubin's position with all the energy in my
reproductive system.

First of all, as far as it is known today, Entropy, Probability, and
(more recently discovered) Codes (as in binary codes) are pretty much
aspects of the same thing. At a fundamental level, Entropy is just the
number of available distinguishable possibilities in the (neg)-log
scale so that exp(Entropy)=1/N =Uniform probability over the space of
distinguishable states. Moreover, there is a one-to-one correspondence
between probability distributions and codes (or rather code lengths of
root-free (prefix) codes)
(e.g. see Grunwald's tutorial
http://quantrm2.psy.ohio-state.edu/injae/workshop.htm ) . Thus, any
one caring about the meaning and use of Probability theory (Bayesians
or members of the national riffle association alike) aught to care
about Entropy and Codes.

Second. More than seventy (70) years of DeFinetti/Savage subjectivism
have produce ZIP beyond beautiful sun tans from the coasts of Spain!

Third. Current action in fundamental statistical inference (aside from
computational issues) is about objective (or as objective as possible)
quantifications of prior information. Information geometry, MDL
principle, Entropic Priors, Bayesian Networks and Statistical Learning
Theory are pushing the envelope.

hru...@odds.stat.purdue.edu (Herman Rubin) wrote in message news:<a8fhr9$1q...@odds.stat.purdue.edu>...

Michael J Hardy

unread,

Apr 5, 2002, 12:20:38 PM4/5/02

to

Radford Neal (rad...@cs.toronto.edu) wrote:

> Why should I want a prior that is "most difficult to discriminate from
> an independent model"? Or one that is "most ignorant about the data"?

Your prior needs to incorporate your ignorance if you are ignorant.
Tomorrow's weather and the outcome of a coin toss are _conditionally_
_independent_given_my_knowledge_ if I have no knowledge of any connection
between them.

Mike Hardy

Radford Neal

unread,

Apr 5, 2002, 2:14:50 PM4/5/02

to

> Radford Neal (rad...@cs.toronto.edu) wrote:
>
>> Why should I want a prior that is "most difficult to discriminate from
>> an independent model"? Or one that is "most ignorant about the data"?

Michael J Hardy <mjh...@mit.edu> wrote:>
>
> Your prior needs to incorporate your ignorance if you are ignorant.

There's a logical gap between saying "this prior expresses ignorance
about the data" and "I'm ignorant, therefore I should use this prior".

The first statement implicitly assumes that there's only one possible
"state of ignorance". But it's not clear that real people can be
ignorant in only one way.

As evidence for this logical gap, one need only see that "objective"
Bayesians have come up with numerous priors that all supposedly
express ignorance. It's like the joke about standards for programming
languages - "If one standard is good, then three standards must be
even better!".

>Tomorrow's weather and the outcome of a coin toss are _conditionally_
>_independent_given_my_knowledge_ if I have no knowledge of any connection
>between them.

If you're SURE that there's no connection, then you're not ignorant at
all about the relationship (however ignorant you may be about
individual coin tosses and thunderstorms). If you're NOT sure that
there's no relationship, then the independence applies only to the
FIRST coin toss and thunderstorm. Once you are dealing with more than
one toss, you need to use a prior that expresses how likely the
various possible relationships are. This is related to the fallacy
behind Jaynes contention that the laws of statistical mechanics can be
derived from the maximum entropy principle, without the need for any
input of physical information.

Radford Neal

Carlos C. Rodriguez

unread,

Apr 5, 2002, 3:00:55 PM4/5/02

to

rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr5.0...@jarvis.cs.toronto.edu>...
> >> Radford Neal:

>
> Why should I want a prior that is "most difficult to discriminate from
> an independent model"? Or one that is "most ignorant about the data"?
> And assuming I did want these things, why should I accept that your
> mathematical formulation of what it means to be "ignorant" is the
> correct one? These are not mathematical questions which can be
> settled by a proof.

Recall: X and Y independent iff:
1) P(X|Y) = P(X)
and
2) P(Y|X) = P(Y)
provided both conditionals exist or more conveniently, but less
enlightening,
X and Y independent iff

P(X and Y) = P(X) P(Y)

By "X is ignorant about Y" I mean X is independent of Y. PERIOD.
How much more ignorant of each other can X and Y be?
Are you suggesting changing the meaning of independence?

>
> >OK back to the specifics of your gedankenexperiment. All the prior is
> >saying is that, in the absence of all other information, the means of
> >the rare components should be considered more uncertain than the means
> >of the common components. You may not like that but you have to live
> >with it. It doesn't matter weather you or I or anyone likes it or not.
> >If you say, for example: "what the heck I feel intuitively that an
> >ignorant prior should assign equal uncertainties to all the means
> >independently of the weights". Then Theorem 1 will tell you that your
> >intuitive feeling is a superstition.
>
> No, what Theorem 1 tells ME is that your concept of "ignorance" is
> flawed. Mathematical formulations of such concepts have to be tested

There is no mysterious concept of "ignorance" anymore. It is JUST
INDEPENDENCE!
(see the above)

> >By the way, uncommon components
> >are observed less often than common ones so more a priori uncertainty
> >for the mean sounds good to me, again in the absence of all other
> >information.
>
> That is a good reason why the POSTERIOR uncertainty in the means of
> the rare components will be greater. Why you think that this natural
> effect of not having much data should be increased by also increasing
> the PRIOR uncertainty is a mystery to me.
>

BECAUSE: by assumption the only information assumed is the likelihood.
The ignorant prior is only consistent with the info explicitly
provided, in this case by the likelihood. The parameters for the
uncommon components need to be obviously more uncertain otherwise you
would be claiming a source of information other than the likelihood.
Think about it this way. If you assume that you can ONLY learn about
the beetles by observing them, then you can only know more about the
ones that you can observe more. Whatever prior information you are
going to provide about the rare species of beetles would have to have
come from past observations and by assumption these are more scarce
ergo more prior uncertainty is just compatible with that.

Radford Neal

unread,

Apr 5, 2002, 4:00:09 PM4/5/02

to

In article <c54f89f.02040...@posting.google.com>,

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>Recall: X and Y independent iff:
>1) P(X|Y) = P(X)
>and
>2) P(Y|X) = P(Y)
>provided both conditionals exist or more conveniently, but less
>enlightening,
>X and Y independent iff
>
> P(X and Y) = P(X) P(Y)
>
>By "X is ignorant about Y" I mean X is independent of Y. PERIOD.
>How much more ignorant of each other can X and Y be?
>Are you suggesting changing the meaning of independence?

No. I'm suggesting that "independence" and "ignorance" may not be the
same thing. For one thing, independence is a relationship between
random variables, whereas ignorance is a relationship between a person
and a situation (perhaps described by a set of random variables). So
your phrase "X is ignorant about Y", in which X is a random variable
really makes no sense.

>> >By the way, uncommon components
>> >are observed less often than common ones so more a priori uncertainty
>> >for the mean sounds good to me, again in the absence of all other
>> >information.
>>
>> That is a good reason why the POSTERIOR uncertainty in the means of
>> the rare components will be greater. Why you think that this natural
>> effect of not having much data should be increased by also increasing
>> the PRIOR uncertainty is a mystery to me.
>
>BECAUSE: by assumption the only information assumed is the likelihood.
>The ignorant prior is only consistent with the info explicitly
>provided, in this case by the likelihood. The parameters for the
>uncommon components need to be obviously more uncertain otherwise you
>would be claiming a source of information other than the likelihood.
>Think about it this way. If you assume that you can ONLY learn about
>the beetles by observing them, then you can only know more about the
>ones that you can observe more.

But you're claiming to know more about the more common beatles even
BEFORE you observe them, just because you're ANTICIPATING observing
them later on. This is irrational.

>Whatever prior information you are
>going to provide about the rare species of beetles would have to have
>come from past observations and by assumption these are more scarce
>ergo more prior uncertainty is just compatible with that.

Why can't I have prior information about beatles based on my general
knowledge of biology, rather than based on having run the EXACT same
experiment previously, as you seem to be assuming? Note that ALL
humans have quite a bit of general knowledge about biology (being
biological entites themselves).

Radford Neal

Herman Rubin

unread,

Apr 5, 2002, 4:14:39 PM4/5/02

to

In article <3cacce49$0$3930$b45e...@senator-bedfellow.mit.edu>,

Michael J Hardy <mjh...@mit.edu> wrote:

> Herman Rubin (hru...@odds.stat.purdue.edu) wrote:

For one thing, does such a situation exist?

For another, what is non-informative?

In some cases, one can show that certain types of priors
give good approximations, and event that procedures which
do not compute posteriors can be good.

For example, in testing a point null against a finite
dimensional composite alternative, placing a point mass
at the null and a constant density on the alternative
yields robust results for moderately large samples, and
this works even if one has to use some asymptotic theory
for the test statistics allowed, such as requiring that
the Kolmogorov-Smirnov test be used at the Bayesian
level for it.

Herman Rubin

unread,

Apr 5, 2002, 4:25:04 PM4/5/02

to

In article <2002Apr5.0...@jarvis.cs.toronto.edu>,
Radford Neal <rad...@cs.toronto.edu> wrote:
>>> Radford Neal:

..................

>Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>>Nice example. Wrong interpretation.
>>First of all, you can't quarrel with a theorem. The entropic prior for
>>the parameters of the mixture i.e. for the means, sds and weights is
>>proven to be the most difficult to discriminate from an independent
>>model on the space (data,parameters). Thus, in the absence of all
>>other information, WHATEVER PROPERTY THIS PRIOR HAS IS THE PROPERTY
>>THAT IT HAS TO HAVE in order to be the most ignorant about the data.
>>That's the beauty of mathematics. Once you accept the proof of Theorem
>>1 you are stuck with it.

Why should anyone want to consider this as a criterion? It
is a pure mathematics criterion, and what does it have to
do with the problem of statistical inference?

>Why should I want a prior that is "most difficult to discriminate from
>an independent model"? Or one that is "most ignorant about the data"?
>And assuming I did want these things, why should I accept that your
>mathematical formulation of what it means to be "ignorant" is the
>correct one? These are not mathematical questions which can be
>settled by a proof.

The attempt to avoid input from the one with the statistical
problem violates several of my "Commandments". Here they
are, and this does show what one needs to consider. In
particular, Mr. Rodriguez is violating either #3 or #5.

It is religious ritual, rather than good statistics, to
let the prior or loss, and it is only their product which
matters in any case, come from other than consideration
of the problem. Now one might be able to show that using
maximum entropy does a good job of approximating the
results wanted, in which case #4 can be used to justify it.
But, without doing this, there is no justification for the
use of maximum entropy.

I am often requested to repost my five commandments. These are
posted here without exegesis.

For the client:

1. Thou shalt know that thou must make assumptions.

2. Thou shalt not believe thy assumptions.

For the consultant:

3. Thou shalt not make thy client's assumptions for him.

4. Thou shalt inform thy client of the consequences
of his assumptions.

For the person who is both (e. g., a biostatistician or psychometrician):

5. Thou shalt keep thy roles distinct, lest thou violate
some of the other commandments.

The consultant is obligated to point out how their assumptions affect
their views of their domain; this is in the 4-th commandment. But the
consultant should be very careful in the assumption-making process not
to intrude beyond possibly pointing out that certain assumptions make
large differences, while others do not. A good example here is regression
analysis, where often normality has little effect, but the linearity of
the model is of great importance. Thus, it is very important for the
client to have to justify transformations.

There are, unfortunately, many fields in which much of the activity
consists of using statistical procedures without regard for any assumptions.

Carlos C. Rodriguez

unread,

Apr 6, 2002, 2:05:39 PM4/6/02

to

rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr5.1...@jarvis.cs.toronto.edu>...

> In article <c54f89f.02040...@posting.google.com>,
> Carlos C. Rodriguez <car...@math.albany.edu> wrote:
>
> >Recall: X and Y independent iff:
> >1) P(X|Y) = P(X)
> >and
> >2) P(Y|X) = P(Y)
> >provided both conditionals exist or more conveniently, but less
> >enlightening,
> >X and Y independent iff
> >
> > P(X and Y) = P(X) P(Y)
> >
> >By "X is ignorant about Y" I mean X is independent of Y. PERIOD.
> >How much more ignorant of each other can X and Y be?
> >Are you suggesting changing the meaning of independence?
>
> No. I'm suggesting that "independence" and "ignorance" may not be the
> same thing. For one thing, independence is a relationship between
> random variables, whereas ignorance is a relationship between a person
> and a situation (perhaps described by a set of random variables). So
> your phrase "X is ignorant about Y", in which X is a random variable
> really makes no sense.
>

This sounds like a desperate kick to me.
Just model "person" by another set of rvs that specify its state i.e.
the parameters of the likelihood.

> >> >By the way, uncommon components
> >> >are observed less often than common ones so more a priori uncertainty
> >> >for the mean sounds good to me, again in the absence of all other
> >> >information.
> >>
> >> That is a good reason why the POSTERIOR uncertainty in the means of
> >> the rare components will be greater. Why you think that this natural
> >> effect of not having much data should be increased by also increasing
> >> the PRIOR uncertainty is a mystery to me.
> >
> >BECAUSE: by assumption the only information assumed is the likelihood.
> >The ignorant prior is only consistent with the info explicitly
> >provided, in this case by the likelihood. The parameters for the
> >uncommon components need to be obviously more uncertain otherwise you
> >would be claiming a source of information other than the likelihood.
> >Think about it this way. If you assume that you can ONLY learn about
> >the beetles by observing them, then you can only know more about the
> >ones that you can observe more.
>
> But you're claiming to know more about the more common beatles even
> BEFORE you observe them, just because you're ANTICIPATING observing
> them later on. This is irrational.
>

Behind your bravado I sense that you are about to get the point.
Think about it this way: Suppose that you assume equal uncertainty
about the parameters of all the components. If all the components are
assumed identical then it makes sense BUT if you assume that one
component is more rare than the others THEN the assumption of equal
uncertainty can not be done WITHOUT claiming extra knowledge beyond
the likelihood. (see below for more… )

> >Whatever prior information you are

> >going to provide about the rare species of beatles would have to have

> >come from past observations and by assumption these are more scarce
> >ergo more prior uncertainty is just compatible with that.
>
> Why can't I have prior information about beatles based on my general
> knowledge of biology, rather than based on having run the EXACT same
> experiment previously, as you seem to be assuming? Note that ALL
> humans have quite a bit of general knowledge about biology (being
> biological entites themselves).
>
> Radford Neal

This last paragraph of yours (above) clearly shows why we keep barking
at two quite different trees.
 Why can't I have prior info…. ?
Sure. You can have all kinds of prior info about beatles. The more the
better.
But if you do YOU HAVE TO ADD THAT PRIOR INFO EXPLICITLY. Either
directly to the model, to the initial guess h() or as a constraint to
the variational problem. Once you have explicitly accounted for all
the prior info that you claim you have THEN you want to find the prior
distribution that uses that prior info AND NOTHING ELSE. That's as
honest and as objective as anyone can be.

Encore:
There is nothing wrong with using convenience priors specially if you
are already getting useful answers with them.
(If it works… it's true! Isn't that the American way? Hmm there are
"issues"…)
With convenience priors either the data swamps the prior assumptions
OR you hit the gold by a bit of luck and clever design (always
useful).
But today we offer an alternative new way to our customers…. Encode
what you claim you know EXPLICITLY and then maximize honesty to get
THE ENTROPIC PRIOR and seat back relax and enjoy the show!
Only one problem: It may not be cheap! You may need to build a new
computer or just settle for a cheap approximation in some cases.

Radford Neal

unread,

Apr 6, 2002, 2:54:57 PM4/6/02

to

rad...@cs.toronto.edu (Radford Neal) wrote:

>> No. I'm suggesting that "independence" and "ignorance" may not be the
>> same thing. For one thing, independence is a relationship between
>> random variables, whereas ignorance is a relationship between a person
>> and a situation (perhaps described by a set of random variables). So
>> your phrase "X is ignorant about Y", in which X is a random variable
>> really makes no sense.
>>

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>This sounds like a desperate kick to me.
>Just model "person" by another set of rvs that specify its state i.e.
>the parameters of the likelihood.

This makes even less sense. I had thought that like almost all
Bayesians, you viewed probability as a representation of beliefs.
(Our disagreement, I thought, was over whether there is any such thing
as completely "objective" beliefs). So whose beliefs are being
modeled by this joint distribution over the random variables
describing the world and the random variables describing the person's
beliefs?

>Behind your bravado I sense that you are about to get the point.
>Think about it this way: Suppose that you assume equal uncertainty
>about the parameters of all the components. If all the components are
>assumed identical then it makes sense BUT if you assume that one
>component is more rare than the others THEN the assumption of equal
>uncertainty can not be done WITHOUT claiming extra knowledge beyond

>the likelihood. (see below for more).

Once you realize that the species may differ in abundance, then you
might indeed wonder whether your prior beliefs about other
characteristics should depend on the abundance. You have to think
about it. But it seems pretty bizzarre to me to take the position
that believing these other characteristics vary in a rather peculiar
way with abundance is the DEFAULT, which you should adopt as your
belief if you haven't any reason not to.

>Sure. You can have all kinds of prior info about beatles. The more the
>better.
>But if you do YOU HAVE TO ADD THAT PRIOR INFO EXPLICITLY. Either
>directly to the model, to the initial guess h() or as a constraint to
>the variational problem. Once you have explicitly accounted for all
>the prior info that you claim you have THEN you want to find the prior
>distribution that uses that prior info AND NOTHING ELSE. That's as
>honest and as objective as anyone can be.

This sounds attractive. The problem is that it just doesn't work.
The attempts to formalize the idea of using the explict information
"and nothing else" produce results that are neither unique nor in
(in some cases) sensible.

Radford Neal

Web 2k

unread,

Apr 6, 2002, 8:47:04 PM4/6/02

to

On Fri, 29 Mar 2002 12:51:29 GMT, jmb...@frontiernet.net (John Bailey)
wrote:

>On Thu, 28 Mar 2002 19:33:24 -0800, "James A. Bowery"
><jim_b...@hotmail.com> wrote:
>
>>I'm interested in locating fundamental work in maximum entropy imputation
>>for simple data tables.

Given the original poster's question, does anyone have better
suggestions for references to fundamental work on maximum entropy
imputation than these?

>
>Missing Data, Censored Data, and Multiple Imputation
>http://cm.bell-labs.com/cm/ms/departments/sia/project/mi/index.html
>
>Bayesian Statistics
>http://cm.bell-labs.com/cm/ms/departments/sia/project/bayes/index.html
>
>Multiple Imputation
>http://www.stat.ucla.edu/~mhu/impute.html
>
>"Multiple Imputation for Missing Data: Concepts and New Development"
>http://www.sas.com/rnd/app/papers/multipleimputation.pdf
>
>Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys,
>New York: John Wiley & Sons, Inc.
>
>Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, New
>York: Chapman and Hall
>
>http://www.sas.com/rnd/app/da/new/pdf/dami.pdf
>
>Multiple Imputation References
>http://www.statsol.ie/solas/sorefer.htm
>
>John

Carlos C. Rodriguez

unread,

Apr 7, 2002, 10:28:33 PM4/7/02

to

I think we have reached the point of diminishing returns.
We have barked at each other our points several times and we may
finally agree to disagree.

To make a last attempt at convincing you (and other present and future
watchers out there) of the importance of Theorem1 in,
http://omega.albany.edu:8008/0201016.pdf
let's clean the blackboard and summarize.
============================================
General Fact1:
Among all possible distributions on the parameters of a regular
parametric model, the one that is most difficult to discriminate from
an independent model on (parameters,data) is the Entropic Prior.

Specific Fact2:
The Entropic Prior for the parameters of a Gaussian mixture turns out
to be very similar to the popular conjugate prior except that the
uncertainty on the parameters of each component depends on the weight
assigned to that component. The smaller the weight the larger the
uncertainty.

==============================================

Your (RN) position:
Forget Fact1. I find your Specific Fact2 counter-intuitive and even
irrational. Ergo, we can happily forget about the Entropic Prior
business.

My (CR) position:
Whatever intuition you may have about how a most ignorant prior about
the data should look like, if your intuition doesn't agree with the
Facts above then, provided the Facts above are correct, we can happily
forget about your intuition.

==============================================

Your Argument:
1) Why should anyone care about the prior that YOU say is the most
difficult to discriminate from an independent model meaning something
that looks to me as a cooked up manipulation of symbols to get what
you want?
2) And look, your specific Fact2 is clearly crazy for I can find lots
of real life examples where it appears as encoding prior information
that doesn't exist or even worst that runs contrary to what we know
for that problem.
Here is an example: Suppose we want to study how far the
population of beatles from the north of Srilanka are able to travel.
Suppose that we know that the beatles from Srilanka are of one of two
kinds. One populous species and one rare species. We naturally model
the observed data of traveled distances as a two component mixture or
gaussians. Your entropic prior will assign A PRIORI more uncertain to
the average distance traveled by the rare species. That's SILLY I
could have all kinds of bio info against that!

My Argument:
1) The math behind Fact1 is standard and (subtle but once understood)
trivial.
The Kullback number between two probability measures P and Q, denoted
I(P:Q)
(*** where for us:
P=f(data|params)*p(params) (i.e. likelihood times prior) and
Q=h(data)*g(params) (i.e. an independent model, some arbitrary but
fix density h() for data and the local uniform g() on the (manifold)
parameters)
defined on the same measurable space (data,parameters)
****)
is the universally accepted information-theoretic-probabilistic
measure of how easy it is to discriminate Q from P. It is nothing but
the mean information for discrimination in favor or P and against Q
when sampling from P. Look at the first chapter of Kullback's book or
ask your gurus or search the net or whatever. Just in case you still
have issues with I(P:Q) let me remind anyone watching that a simple
monotone increasing function of I(P:Q) is an upper bound to the total
variation distance between P and Q (Bretagnolle-Huber inequality).
TRANSLATION: If I(P:Q) is small (close to 0) then P and Q are close in
total variation i.e. close in the most natural way for probability
measures.

FACT1 (again):
The proper prior p(params) that minimizes I(P:Q), when data consists
of alpha independent observations of the model, is the Entropic Prior
with parameters h and alpha. It is only natural to call this prior
most ignorant about the data since Q is an independent (product) model
"h(data)*g(params)" where params are statistically independent of the
data.

There is nothing fishy or unnatural about Fact1. The true power of
Fact1 comes from its generality. It holds for ANY regular hypothesis
space in any number of dimensions. Even in infinite dimensional (i.e.
for stochastic processes) hypothesis spaces… but there is no room in
the electronic margins of this note to show you the proof… (ok I am
pushing it a little…).

*** I remind whoever is listening that once you allow Fact1 to get
"jeegee" (As in Austin Powers "Get jeegee with it") with your mind,
you become pregnant and there is no need to bother with answering (2).
Your baby-to-be will give you the answer! For those virtuous minds
still out there, here is a way:

2) The only prior information that we assume that we have about the
beatles is the one in the likelihood and the parameters of the
entropic prior (h and alpha). NOTHING ELSE. If there is extra prior
info, biological or whatever ,that info must be EXPLICITLY included in
the problem. Either in the likelihood, h,alpha or as a constraint for
the minimization of I(P:Q). Only after including ALL the info that we
want to consider, only after that, we maximize honesty and take the
most ignorant prior consistent with what we know. Fact2, as it is
presented here, applies only to that state of ignorance.

When all we assume we know is the likelihood, Fact2 is not only sane
but obvious. Of course the parameters of the rare components of the
mixture are A PRIORI more uncertain. There is always less info coming
from there and we know that A PRIORI even BEFORE we collect any data.
Another way to state this is:
THE ONLY way to be able to assume equal uncertainty for all the
components regardless of their mixture weights is to ASSUME a source
of information OTHER than the likelihood. Q.E.D.

Extra bonus: The above argument opens the gates of uncertainty to all
the MCMC simulations based on the standard conjugate prior for
mixtures of Gaussians.

P.S.
I am willing to spend another google amount of time because I do find
you one of the coolest MCMC guys around. Your exposition of the hybrid
monte carlo method in http://omega.albany.edu:8008/neal.pdf was an
eye opener for me, and I believe there is still a diamond's mine to be
discovered along those directions. Now that you know that I know that
I think you are so cool let me tell you that you are nevertheless
human. But that's OK. Isn't it?
(It would be great if you go to Moscow, Id. this summer for
MaxEnt2002)

Radford Neal

unread,

Apr 8, 2002, 12:16:53 AM4/8/02

to

In article <c54f89f.02040...@posting.google.com>,

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>I think we have reached the point of diminishing returns.
>We have barked at each other our points several times and we may
>finally agree to disagree.

I think you're right. Your summary of the two positions is reasonably
accurate. When you get to arguing that your's is the correct position,
I of course disagree, and I could explain why I think so - but I've
already explained in previous posts, so we should probably let readers
of this thread (assuming there still are any) ponder the matter on
their own.

Radford Neal

Herman Rubin

unread,

Apr 12, 2002, 9:17:27 AM4/12/02

to

In article <c54f89f.02040...@posting.google.com>,

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>I think we have reached the point of diminishing returns.
>We have barked at each other our points several times and we may
>finally agree to disagree.

>To make a last attempt at convincing you (and other present and future
>watchers out there) of the importance of Theorem1 in,
>http://omega.albany.edu:8008/0201016.pdf
>let's clean the blackboard and summarize.

>General Fact1:

>Among all possible distributions on the parameters of a regular
>parametric model, the one that is most difficult to discriminate from
>an independent model on (parameters,data) is the Entropic Prior.

This is from the standpoint of Wiener-Shannor information,
not that of statistical inference.

>Specific Fact2:
>The Entropic Prior for the parameters of a Gaussian mixture turns out
>to be very similar to the popular conjugate prior except that the
>uncertainty on the parameters of each component depends on the weight
>assigned to that component. The smaller the weight the larger the
>uncertainty.

I am strongly opposed to the anti-Bayesian use of the
conjugate prior, preferring instead to look at robustness
of the procedure. If estimating a normal mean with the
prior being not too concentrated, a normal prior is the
one I would be least likely to use, as it is far too
sensitive. If the prior is concentrated, it does not
make much difference if it is normal or not, as long as
it has a small variance; it it does not have a small
variance, robustness is very difficult.

In fact, if one assumes a normal prior, the Bayes risk
is at most doubled if one replaces a prior whose variance
is at least the data variance by an infinite variance, and
if the variance is at most the data variance by a one-point
distribution.

The prior should come from the user's assumptions, not
from mathematical convenience. One can use robustness
theorems to approximate procedures, but it is the effect
on the risk, not the closeness of the prior, which is
the relevant consideration, and these are quite different.
In testing a point or local hypothesis, the prior
probability of the hypothesis is often totally irrelevant
if there is any data.