Maximum Entropy Imputation

513 views
Skip to first unread message

James A. Bowery

unread,
Mar 28, 2002, 10:33:24 PM3/28/02
to
I'm interested in locating fundamental work in maximum entropy imputation
for simple data tables.

I've done a Google search for papers on imputation via maximum entropy but
found virtually nothing except a paper from Russia which seems to be
concerned primarily with longitudinal statistics or process data.


John Bailey

unread,
Mar 29, 2002, 7:51:29 AM3/29/02
to

Maximum entropy is implied for any technique using Bayesina Inference,
nit wahr?

I somewhat casually selected these by screening a search using the
keyword *imputation*

Missing Data, Censored Data, and Multiple Imputation
http://cm.bell-labs.com/cm/ms/departments/sia/project/mi/index.html

Bayesian Statistics
http://cm.bell-labs.com/cm/ms/departments/sia/project/bayes/index.html

Multiple Imputation
http://www.stat.ucla.edu/~mhu/impute.html

"Multiple Imputation for Missing Data: Concepts and New Development"
http://www.sas.com/rnd/app/papers/multipleimputation.pdf

Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys,
New York: John Wiley & Sons, Inc.

Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, New
York: Chapman and Hall

http://www.sas.com/rnd/app/da/new/pdf/dami.pdf

Multiple Imputation References
http://www.statsol.ie/solas/sorefer.htm

John

Robert Ehrlich

unread,
Mar 29, 2002, 10:14:36 AM3/29/02
to
Perhaps something called the "Berg" method of maximum entropy might fill the
bill. It is used in signal processing.

Michael J Hardy

unread,
Mar 29, 2002, 3:27:04 PM3/29/02
to
John Bailey (jmb...@frontiernet.net) wrote:

> Maximum entropy is implied for any technique using Bayesina Inference,
> nit wahr?


No, I don't think so. Why would you say that? -- Mike Hardy

John Bailey

unread,
Mar 29, 2002, 6:18:08 PM3/29/02
to
On 29 Mar 2002 20:27:04 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

> John Bailey (jmb...@frontiernet.net) wrote:
>
>> Maximum entropy is implied for any technique using Bayesian Inference,


>> nit wahr?
>
>
> No, I don't think so. Why would you say that? -- Mike Hardy

Except for the No, I don't think so, I would have thought you were
being sarcastic, that I had stated the obvious.

Let me say it a different way. Are there statistical techniques for
obtaining maximum entropy estimates which are not Bayesian?

Are they sufficiently well known as to be suitable in a general search
of the web for references?

John

Henry

unread,
Mar 29, 2002, 6:46:33 PM3/29/02
to
On Fri, 29 Mar 2002 23:18:08 GMT, jmb...@frontiernet.net (John Bailey)
wrote:

>Let me say it a different way. Are there statistical techniques for
>obtaining maximum entropy estimates which are not Bayesian?
>
>Are they sufficiently well known as to be suitable in a general search
>of the web for references?

http://www-2.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html
barely mentions Bayesian techniques, except for a throwaway line that
Bayesians use "fuzzy maximum entropy".

I don't fully follow how maximum entropy works, but might it be
possible to use maximum liklihood techniques in some cases?

John Bailey

unread,
Mar 29, 2002, 9:33:09 PM3/29/02
to
On Fri, 29 Mar 2002 23:46:33 +0000 (UTC), se...@btinternet.com (Henry)
wrote:

from http://www.aas.org/publications/baas/v32n4/aas197/352.htm
A Maximum-Entropy Approach to Hypothesis Testing: An
Alternative to the p-Value Approach

P.A. Sturrock (Stanford University)
(quoting)
In problems of the Bernoulli type, an experiment or observation yields
a count of the number of occurrences of an event, and this count is
compared with what it to be expected on the basis of a specified and
unremarkable hypothesis. The goal is to determine whether the results
support the specified hypothesis, or whether they indicate that some
extraordinary process is at work. This evaluation is often based on
the ``p-value" test according to which one calculates, on the basis of
the specific hypothesis, the probability of obtaining the actual
result or a ``more extreme" result. Textbooks caution that the p-value
does not give the probability that the specific hypothesis is true,
and one recent textbook asserts ``Although that might be a more
interesting question to answer, there is no way to answer it."

The Bayesian approach does make it possible to answer this question.
As in any Bayesian analysis, it requires that we consider not just one
hypothesis but a complete set of hypotheses. This may be achieved very
simply by supplementing the specific hypothesis with the
maximum-entropy hypothesis that covers all other possibilities in
a way that is maximally non-committal. This procedure yields an
estimate of the probability that the specific hypothesis is true. This
estimate is found to be more conservative than that which one might
infer from the p-value test.
(end quote)

Michael J Hardy

unread,
Mar 30, 2002, 5:19:02 PM3/30/02
to
John Bailey (jmb...@frontiernet.net) wrote:

> Maximum entropy is implied for any technique using Bayesian Inference,
> nit wahr?

I answered:

> No, I don't think so. Why would you say that? -- Mike Hardy

He replied:

> Except for the No, I don't think so, I would have thought you were
> being sarcastic, that I had stated the obvious.

I think you're very confused.

> Let me say it a different way. Are there statistical techniques
> for obtaining maximum entropy estimates which are not Bayesian?

Oh God. *First* you said Maximum entropy is "implied for" any
technique using Bayesian inference; now you seem to be saying the
exact converse of that --- the other way around. I seldom see anyone
write less clearly.

Bayesianism is the belief in, or use of, a degree-of-belief
interpretation of probability. Your *first* posting seemed to say
for any technique using Bayesian inference, i.e., using a degree-of-
belief interpretation of probability, "maximum entropy is implied."
In fact it's perfectly routine to do Bayesian inference without ever
thinking about entropy at all. Your *next* posting seems to say,
*not* that Bayesian inference implies maximum likelihood, but the
reverse: that maximum likelihood implies Bayesian inference. Which
did you mean? Or did you mean both? Why can't you be clear about
that?

To answer your question: algorithms for obtaining estimates do
not require degree-of-belief interpretations of probability; they
don't require frequency interpretations; they don't require any
interpretations at all. So they don't have to be Bayesian. And I
doubt that any of them are in any way Bayesian, even if they are
relied on in doing Bayesian inference.

Mike Hardy

Radford Neal

unread,
Mar 30, 2002, 5:24:15 PM3/30/02
to
> John Bailey (jmb...@frontiernet.net) wrote:
>
>> Maximum entropy is implied for any technique using Bayesina Inference,
>> nit wahr?

No. Bayesian inference and maximum entropy methods as originally
defined are in fact incompatible. This is hardly surprising - if
you're setting probabilities by maximum entropy, it would be a big
coincidence if they turned out to be the same as what one would get by
a completely different method.

You're probably confused by the tendency of many of the old maximum
entropy advocates to claim that they were Bayesians. This just shows
that words and reality don't necessarily match.

More recently, the maximum entropy folks have pretty much abandoned
the old version of maximum entropy in favour of Bayesian methods using
priors that are defined in terms of entropy functions. This is
incompatible with the old maximum entropy methods. These priors may
be useful now and then, but there's no reason to limit yourself to them.

Radford Neal

John Bailey

unread,
Mar 30, 2002, 8:30:15 PM3/30/02
to
On 30 Mar 2002 22:19:02 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

> John Bailey (jmb...@frontiernet.net) wrote:
>> Let me say it a different way. Are there statistical techniques
>> for obtaining maximum entropy estimates which are not Bayesian?
> Oh God. *First* you said Maximum entropy is "implied for" any
>technique using Bayesian inference; now you seem to be saying the
>exact converse of that --- the other way around. I seldom see anyone
>write less clearly.

(that coming from someone who's first post was so obscure everyone
missed his point.)
(snipped)


> To answer your question: algorithms for obtaining estimates do
>not require degree-of-belief interpretations of probability; they
>don't require frequency interpretations; they don't require any
>interpretations at all. So they don't have to be Bayesian. And I
>doubt that any of them are in any way Bayesian, even if they are
>relied on in doing Bayesian inference.

I think the cause of the confusion is that my focus was the pragmatic
challenge of finding appropriate key words or phrases for an effective
web search and you are hung up on the religious aspects of Bayesian
vs frequentists theology.

Just in case anyone listening in has doubts about this aspect of a
perfectly innocent estimation technique I commend the following
presentation:
http://www.google.com/url?sa=U&start=22&q=http://umaxp1.physics.lsa.umich.edu/~kelly/bayes/intro_talk.ppt&e=933
Too bad its in Powerpoint but I recommend it anyway.
Here is a excerpt:
Bayesian/Frequentist results approach mathematical identity only if:
BPT uses priors with high degree of ignorance,
there are sufficient statistics, and
FPT distribution depends only on the sufficient statistic, and that it
is randomly distributed about the true value.
This convergence is seen as coincidental. (end quote)

John

Michael J Hardy

unread,
Mar 31, 2002, 3:18:04 PM3/31/02
to
John Bailey (jmb...@frontiernet.net) wrote:

> I think the cause of the confusion is that my focus was the pragmatic
> challenge of finding appropriate key words or phrases for an effective
> web search and you are hung up on the religious aspects of Bayesian
> vs frequentists theology.
>
> Just in case anyone listening in has doubts about this aspect of a
> perfectly innocent estimation technique I commend the following


I don't think anyone in this thread questioned any aspects of
any estimation technique.

Look: Most practitioners of Bayesian inference probably do not
know what entropy is. That appears to contradict what you said in
your posting that I first answered. Can you dispute that?


> presentation:
> http://www.google.com/url?sa=U&start=22&q=http://umaxp1.physics.lsa.umich.edu/~kelly/bayes/intro_talk.ppt&e=933
> Too bad its in Powerpoint but I recommend it anyway.
> Here is a excerpt:
> Bayesian/Frequentist results approach mathematical identity only if:
> BPT uses priors with high degree of ignorance,
> there are sufficient statistics, and
> FPT distribution depends only on the sufficient statistic, and that it
> is randomly distributed about the true value.
> This convergence is seen as coincidental. (end quote)


I have no idea what kind of software would be needed to read this
document, so at this point it's entirely illegible to me. What do you
mean by "BPT" and "FTP"?

Mike Hardy

John Bailey

unread,
Mar 31, 2002, 5:32:51 PM3/31/02
to
On 31 Mar 2002 20:18:04 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

> John Bailey (jmb...@frontiernet.net) wrote:
>
>> I think the cause of the confusion is that my focus was the pragmatic
>> challenge of finding appropriate key words or phrases for an effective
>> web search and you are hung up on the religious aspects of Bayesian
>> vs frequentists theology.

> Look: Most practitioners of Bayesian inference probably do not
>know what entropy is. That appears to contradict what you said in
>your posting that I first answered. Can you dispute that?
>

I will definitely dispute the first part. My first professional use
of Bayesian methodology was in 1960 using seminal work of C. K. Chow,
where it was indispensible for the final design of an Optical
Character Reader for RCA. My understanding of theory was updated in
the 80s by working with Myron Tribus, of Dartmouth fame and needing to
assimilate his use of maximum entropy methods as defined in his book
Rational Descriptions, Decisions and Designs. In that period we made
extensive use of Bayesian statistics in test design and interpretation
for high end Xerox reprographic machines. Ron Howard and Howard
Raiffa of Stanford were big guns who kept us on track in our
application of theory. I suppose there may be *practitioners of
Bayesian inference who are weak on the concept of entropy* but it is
clearly and unambiguously a part of the theory of its use.

Another worthwhile web reference I uncovered recently is:
http://xyz.lanl.gov/abs/hep-ph/9512295
Probability and Measurement Uncertainty in Physics - a Bayesian
Primer by G. D'Agostini (quoting from the abstract:)
The approach, although little known and usually misunderstood among
the High Energy Physics community, has become the standard way of
reasoning in several fields of research and has recently been adopted
by the international metrology organizations in their recommendations
for assessing measurement uncertainty. (end quote)

>>
http://www.google.com/url?sa=U&start=22&q=http://umaxp1.physics.lsa.umich.edu/~kelly/bayes/intro_talk.ppt&e=933


> I have no idea what kind of software would be needed to read this
>document, so at this point it's entirely illegible to me. What do you

>mean by "BPT" and "FPT"?

The document was posted from Microsoft Office presentation software
called Powerpoint. Its unfortunate that the document is not available
in a more neutral format, but I am sending you a print version
rendered by processing his presentation through adobe acrobat, pdf
format.

BPT is the authors shorthand for Bayesian Probablity Theory and FPT
is shorthand for Frequentist Probablity Theory

John

Michael J Hardy

unread,
Apr 1, 2002, 1:21:18 PM4/1/02
to
> > Look: Most practitioners of Bayesian inference probably do not
> >know what entropy is. That appears to contradict what you said in
> >your posting that I first answered. Can you dispute that?
>
>
> I will definitely dispute the first part. My first professional use
> of Bayesian methodology was in 1960 using seminal work of C. K. Chow,
> where it was indispensible for the final design of an Optical Character
> Reader for RCA. My understanding of theory was updated in the 80s by
> working with Myron Tribus, of Dartmouth fame and needing to assimilate
> his use of maximum entropy methods as defined in his book Rational
> Descriptions, Decisions and Designs. In that period we made extensive
> use of Bayesian statistics in test design and interpretation for high
> end Xerox reprographic machines. Ron Howard and Howard Raiffa of
> Stanford were big guns who kept us on track in our application of
> theory. I suppose there may be *practitioners of Bayesian inference
> who are weak on the concept of entropy* but it is clearly and
> unambiguously a part of the theory of its use.


I don't doubt that people you worked with are familiar with
entropy, nor that some people who do Bayesian inference use entropy,
but it is perfectly obvious that such familiarity is not needed in
order to do Bayesian inference. Why do you call it "clearly and
unambiguously a part of the theory of its use"?

Mike Hardy

Robert Ehrlich

unread,
Apr 1, 2002, 7:04:58 PM4/1/02
to
Sorry. In a recent post on this subject I mentioned "Berg's maximum entropy
method"
I was incorrect it is "Burg's maximum entropy method". This makes a
difference in that Berg is involved in the entropy / Bayes arguments but Burg
is not. Burg's insight concerns estiamtion of the amplitude of poorly
sampled low frequency phenomena and is used a lot in signal processing. It
has turned out in practice to be reasonably useful and robust even though the
assumptions are merely "plausible" rather than proven to be necessary and
sufficient. I have not kept up with the evolution of Burgs insights over the
past decade and would appreciate some comments on where it has all led.

John Bailey

unread,
Apr 1, 2002, 7:14:35 PM4/1/02
to
On 01 Apr 2002 18:21:18 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

>> > Look: Most practitioners of Bayesian inference probably do not
>> >know what entropy is. That appears to contradict what you said in
>> >your posting that I first answered. Can you dispute that?

In an earlier post, John Bailey's response to Hardy's statement was:


>> I will definitely dispute the first part.

>> I suppose there may be *practitioners of Bayesian inference
>> who are weak on the concept of entropy* but it is clearly and
>> unambiguously a part of the theory of its use.
>

Mike Hardy then replied:


> I don't doubt that people you worked with are familiar with
>entropy, nor that some people who do Bayesian inference use entropy,
>but it is perfectly obvious that such familiarity is not needed in
>order to do Bayesian inference. Why do you call it "clearly and
>unambiguously a part of the theory of its use"?

In my exposures to Bayesian methodology all have included a discussion
of how to determine a neutral Bayesian prior and the use of maximum
entropy as a means to that end.

John


James Beck

unread,
Apr 1, 2002, 10:53:55 PM4/1/02
to
mike:

since this thread seemed unusually aggressive and defensive, and since i am
a practioner of bayesian inference who had never heard "entropy" associated
with that practice, i found it sufficiently interesting to do a little
checking. none of my bayesian textbooks refer to entropy, at all. . . .huh.
it was, at least, a relief to know that i had not simply slept through a key
topic.

however, since that absence seemed strange--for a line of inquiry that could
be described by someone else as "clearly and unambiguously . . ."--i checked
a little more and found dozens of references to entropy in some of
my--regrettably ill-used--books on digital signals processing. entropy seems
particularly well-associated with optical signals compression,
decompression, reading, and reproduction specifically because there is a
high value assigned to maximum loss. for example, if i didn't have to
compress everything, i could potentially save a lot. there would be an
associated cost at decompression. that sounds like a field where one might
find some bayesians.

then i stopped to think, "none of my textbooks are called anything like
Rational Descriptions, Decisions, and Designs (Tribus)," either, so maybe i
was just thinking in the wrong part of the box. unfortunately, the book is
out of print, and sells used at amazon for $176. (makes me wonder what the
original price was. maybe i'll buy it anyway. i don't know of many used
textbooks that appreciate in price.)

it's hard to be sure, but i suspect that if you think in terms of rational
decision making, you will realize that there was a lot of merit, albeit
sensitive to context, in the other position. you may also find that you are
the perfect person to write the next "bridge" text on the use of bayesian
inference in decision making.


Michael J Hardy <mjh...@mit.edu> wrote in message
news:3ca8a51e$0$3940$b45e...@senator-bedfellow.mit.edu...

John Bailey

unread,
Apr 2, 2002, 8:58:43 AM4/2/02
to
On Tue, 02 Apr 2002 03:53:55 GMT, "James Beck"
<james....@verizon.net> wrote:

>mike:
>
>since this thread seemed unusually aggressive and defensive, and since i am
>a practioner of bayesian inference who had never heard "entropy" associated
>with that practice, i found it sufficiently interesting to do a little
>checking. none of my bayesian textbooks refer to entropy, at all. . . .huh.
>it was, at least, a relief to know that i had not simply slept through a key
>topic.

It's chapter 11 of Jaynes' book.
http://omega.albany.edu:8008/ETJ-PS/cc11g.ps

>then i stopped to think, "none of my textbooks are called anything like
>Rational Descriptions, Decisions, and Designs (Tribus)," either, so maybe i
>was just thinking in the wrong part of the box. unfortunately, the book is
>out of print, and sells used at amazon for $176. (makes me wonder what the
>original price was. maybe i'll buy it anyway. i don't know of many used
>textbooks that appreciate in price.)
>

The price of Tribus' text, published in 1969 (!) may be an indication
of how far ahead his thinking was or how little work went on in the
field until recently.

>it's hard to be sure, but i suspect that if you think in terms of rational
>decision making, you will realize that there was a lot of merit, albeit
>sensitive to context, in the other position. you may also find that you are
>the perfect person to write the next "bridge" text on the use of bayesian
>inference in decision making.

It does appear there is an information gap here. Information
arbitrage required?

Between Tribus' book (my copy of which I went to some lengths to
acquire after my first copy was borrowed and never returned), Ron
Howard's book (Dynamic Programming and Markov Processes) and Howard
Raiffa's book(Decision Analysis) it would be a lot of work to push
ahead into anything new. A quick review of
http://www-zeus.roma1.infn.it/~agostini/prob+stat.html including some
of his reprints at:
http://lanl.arXiv.org/find/physics/1/au:+DAgostini_G/0/1/0/all/0/1
suggests that Dagostini might be a good author for such a book.

Finally, I need to credit Carlos Rodriguez <car...@math.albany.edu>
for his
Maximum Entropy Online Resources
http://omega.albany.edu:8008/maxent.html

John
http://www.frontiernet.net/~jmb184

Carlos C. Rodriguez

unread,
Apr 3, 2002, 10:17:53 AM4/3/02
to
rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Mar30.1...@jarvis.cs.toronto.edu>...

Let me add some more heat, uncertainty, entropy and time to this
discussion...

I can easily envision myself wasting a google amount of time fighting
wind mills over the meaning of probability and entropy... so I'll be
brief.
Please go ahead, make my day and click me!....
http://omega.albany.edu:8008/

I know that Radford is a wff (well-(in)formed-fellow): Just look at
his 93 review of MCMC (e.g. http://omega.albany.edu:8008/neal.pdf).
BUT I TOTALLY disagree with his last paragraph:

> More recently, the maximum entropy folks have pretty much abandoned
> the old version of maximum entropy in favour of Bayesian methods using
> priors that are defined in terms of entropy functions. This is
> incompatible with the old maximum entropy methods. These priors may
> be useful now and then, but there's no reason to limit yourself to them.
>
> Radford Neal

By ME folks, he means it literally. By "pretty much
abandoned...functions.", he means
http://omega.albany.edu:8008/0201016.pdf

This is NOT incompatible with the old maximum entropy methods,
(just take alpha LARGE and maximum aposteriori becomes maximum entropy
the old fashion way).
Entropic priors are not only Re-volutionary, they are E-volutionary!

By "These priors may... to them". He means,

I want to be free to continue using my convenience priors so I will
continue ignoring the fact that entropic priors are maximally
non-commital with respect to missing information (thanks Ed!) but
just-in-case I'm missing something and
entropic priors are really as cool as you claim they are I'll keep
them arround.

As Jaynes discovered:
"First they'll say that it is wrong. Then they'll say that it is not
wrong but irrelevant. And finally they'll say that it is wright and
usefull but that they knew it long time ago"

Hiu Chung Law

unread,
Apr 3, 2002, 11:54:29 AM4/3/02
to
There are several ways to design uninformed priors, and maximum entropy
prior is one of them. So is maximum entropy prior superior to all other
kinds of uninformed priors in all applications?

Actually I know very little about maximum entropy. I have only glimpsed
through one book on maximum entropy method, and my former boss, whom I
regard as a Bayesian, never talked to me about maximum entropy.... and
I learn the Bayesian paradigm from him.

It would be nice if you can post some pointers (of tutorial type)
on ME method. Thank you.

Myron Tribus

unread,
Apr 3, 2002, 11:57:53 AM4/3/02
to
"James A. Bowery" <jim_b...@hotmail.com> wrote in message news:<ua7oas4...@corp.supernews.com>...

The book, "Rational Descriptions, Decisions and Designs" describes how
the principle of maximum entropy should be used in connection with
Bayes' Equation for a variety of problems in several fields. This
book was originally published in 1969 by Pergamon Press. It went out
of print some time after. A couple of years ago Expira of Sweden
issued a reprint. Amazon indicates that a used version may be
purchased for $175. Expira sells the reprinted version for less than
$50. Write to Hakan Sodersved <in...@expira.se> for detailed
information.
Myron Tribus mtr...@earthlink.net
350 Britto Terrace, Fremont, CA 94539
Ph: (510) 651 3641 Fax: (510) 656 9875
The establishment always rejects new ideas for it is
composed of people who, having found some of the truth yesterday
believe they possess all of it today. (E. T. Jaynes)

Herman Rubin

unread,
Apr 3, 2002, 12:58:23 PM4/3/02
to
In article <3ca8a51e$0$3940$b45e...@senator-bedfellow.mit.edu>,

> Mike Hardy

I agree with Mike. I consider the use of maximum entropy
to be an attempt to remove the prior from consideration,
and as such, it is only good if the results it gives are
similar.

Like other such methods as "non-informative" priors, etc.,
it is really anti-Bayesian. That something uses a formal
measure as a prior probability for reasons other than that
that measure is the user's prior, or gives good results for
the user's prior and loss function, does not justify it as
being a reasonable procedure.

--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

Herman Rubin

unread,
Apr 3, 2002, 1:32:09 PM4/3/02
to
In article <3ca8f7b5...@news.frontiernet.net>,

>John

Statistics is not methodology. Treating it as such causes
people to use totally inappropriate procedures.

The first thing is to state the problem, and stating a
mathematically convenient formulation can be worse than
useless. Bayesian reasoning requires that the USER be
the provider of the loss-prior combination. Now one
might want to use something simpler if it can be proved
to be reasonably good.

So we can use least squares without normality, as the
Gauss-Markov Theorem tells us that the results are just
about as good without normality as with. This is not
true for using mathematically convenient but inappropriate
priors. Also, it is not how well the prior is approximated,
but how well the solution is.

Bayesian priors should not be "neutral", unless it can
be shown that not much is lost by using such a prior.
Conjugate priors, "uninformative" priors, maximum entropy
priors, as such are unjustified computational copouts.

Radford Neal

unread,
Apr 3, 2002, 5:49:58 PM4/3/02
to
In article <c54f89f.02040...@posting.google.com>,

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>> More recently, the maximum entropy folks have pretty much abandoned
>> the old version of maximum entropy in favour of Bayesian methods using
>> priors that are defined in terms of entropy functions. This is
>> incompatible with the old maximum entropy methods. These priors may
>> be useful now and then, but there's no reason to limit yourself to them.
>>
>> Radford Neal

>By ME folks, he means it literally. By "pretty much
>abandoned...functions.", he means
>http://omega.albany.edu:8008/0201016.pdf

I've had a glance at this, though I can't say I've absorbed it all.
It does seem, however, that the example application to mixtures of
Gaussians produces rather strange results. According to equation
(78), the prior for the mean of a mixture component is more spread out
for rare components than for common components. Why would one want
this? Presumably, there's a problem somewhere where it's just the
right thing to do, but I don't think it's the right thing for most
problems. The argument that one should use this prior despite its
peculiar features because it is "maximally non-commital" in some sense
does not seem to me to be persuasive.

>This is NOT incompatible with the old maximum entropy methods,
>(just take alpha LARGE and maximum aposteriori becomes maximum entropy
>the old fashion way).

If I understand correctly, letting alpha go to infinity results in the
prior for the parameter being concentrated at a point. It was of
course always the case that if you found the maximum entropy
distribution and then specified your prior to be a point mass on this
distribution, then the methds were trivially "compatible". Once you
get into the details of how old "maximum entropy" methods actually
worked, however - such as how constraints on expectations were
obtained from sample means - it's clear that the way they produced a
result from the observed data is not compatible with the way a
Bayesian would produce a result by starting with a prior and
conditioning on observations.

Radford Neal

Carlos C. Rodriguez

unread,
Apr 4, 2002, 8:50:45 AM4/4/02
to
rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr3.1...@jarvis.cs.toronto.edu>...

> In article <c54f89f.02040...@posting.google.com>,
> Carlos C. Rodriguez <car...@math.albany.edu> wrote:
>
> >> More recently, the maximum entropy folks have pretty much abandoned
> >> the old version of maximum entropy in favour of Bayesian methods using
> >> priors that are defined in terms of entropy functions. This is
> >> incompatible with the old maximum entropy methods. These priors may
> >> be useful now and then, but there's no reason to limit yourself to them.
> >>
> >> Radford Neal
>
> >By ME folks, he means it literally. By "pretty much
> >abandoned...functions.", he means
> >http://omega.albany.edu:8008/0201016.pdf
>
> I've had a glance at this, though I can't say I've absorbed it all.
> It does seem, however, that the example application to mixtures of
> Gaussians produces rather strange results. According to equation
> (78), the prior for the mean of a mixture component is more spread out
> for rare components than for common components. Why would one want
> this? Presumably, there's a problem somewhere where it's just the
> right thing to do, but I don't think it's the right thing for most
> problems. The argument that one should use this prior despite its
> peculiar features because it is "maximally non-committal" in some sense

> does not seem to me to be persuasive.
>

Radford, that's not a bug. That's a feature!
Unlike Microsoft, I can prove it (Theorem 1).
In fact whatever property this prior has, it is a product of your own
ignorance. I don't mean that pejoratively I mean it logically. It is
that way because it is most difficult to discriminate from the
independent model between the data and the parameters. It is as blind
of the data as it can possibly be. It lets the data values speak for
themselves as much as it is mathematically possible.

When you say: "Presumably… for most problems" you are changing the
state of ignorance. If you realize that you do have more precise
information for the mean of the rare components in your particular
problem THEN you also realize that you FORGOT to include that
information either into h() or as a side condition. NOW, in the
absence of that information your best guess is to do as the entropic
prior says. That is a tautology yes. Maximum Entropy is tautological,
yes. But that again is not a bug. That's a feature not only of MaxEnt
but of mathematics in general.
By the way what I am saying is not new. Ed Jaynes lost his voice
screaming to the wind mills about it. Don't you agree Myron?
I know it sounds like religion and snake oil, like getting something
(The prior) from nothing (Ignorance) and for that reason many,
otherwise fine minds, have rejected the whole thing as they reject
biblical fundamentalists. As a friend of mine says: If you don't see
it I can not explain it to you!

> >This is NOT incompatible with the old maximum entropy methods,
> >(just take alpha LARGE and maximum aposteriori becomes maximum entropy
> >the old fashion way).
>
> If I understand correctly, letting alpha go to infinity results in the
> prior for the parameter being concentrated at a point. It was of
> course always the case that if you found the maximum entropy
> distribution and then specified your prior to be a point mass on this

> distribution, then the methods were trivially "compatible". Once you


> get into the details of how old "maximum entropy" methods actually
> worked, however - such as how constraints on expectations were
> obtained from sample means - it's clear that the way they produced a
> result from the observed data is not compatible with the way a
> Bayesian would produce a result by starting with a prior and
> conditioning on observations.
>
> Radford Neal

Again an old confusion… there is even a "Theorem" by a student of
Isaac Levi, the philosopher from Columbia University.
Take alpha very large (not just infinity) or very few data or no data
then the posterior is still not a point but completely dominated by
entropy so maximum a posteriori equals maximum entropy.

Radford Neal

unread,
Apr 4, 2002, 2:57:55 PM4/4/02
to
Radford Neal:

>> >http://omega.albany.edu:8008/0201016.pdf
>>
>> I've had a glance at this, though I can't say I've absorbed it all.
>> It does seem, however, that the example application to mixtures of
>> Gaussians produces rather strange results. According to equation
>> (78), the prior for the mean of a mixture component is more spread out
>> for rare components than for common components. Why would one want
>> this? Presumably, there's a problem somewhere where it's just the
>> right thing to do, but I don't think it's the right thing for most
>> problems. The argument that one should use this prior despite its
>> peculiar features because it is "maximally non-committal" in some sense
>> does not seem to me to be persuasive.
>>

Carlos C. Rodriguez <car...@math.albany.edu>:

>Radford, that's not a bug. That's a feature!
>Unlike Microsoft, I can prove it (Theorem 1).
>In fact whatever property this prior has, it is a product of your own
>ignorance. I don't mean that pejoratively I mean it logically. It is
>that way because it is most difficult to discriminate from the
>independent model between the data and the parameters. It is as blind
>of the data as it can possibly be. It lets the data values speak for
>themselves as much as it is mathematically possible.

Consider the problem in an example context: You are interested in
how far beetles travel during a day. With really advanced satellite
observation, you can track beetles flying around, but you can't
identify the species of beetle. You know there are five species of
beetle in a certain forest for which you have data. You therefore
model the distribution of distance travelled in a day as a mixture
of five normal distributions.

Suppose we don't know much about how common the different species are,
or how much the beetles travel in a day - the situation to which you
say your method applies.

The result of your method is a prior which says that the less common
beetles are likely to travel very far in a day, or not very far at
all, whereas the more common beetles are likely to travel a more
moderate distance. This seems to drastically depart from a prior that
embodies no precise information. It seems to correspond to a very
specific biological theory claiming that rare species have to either
travel a lot in a day (to avoid being set upon by gangs of competing
beatles?), or alternatively, to stay put. In no way can I accept that
this is a prior that will "let the data values speak for themselves".

>Again an old confusion. There is even a "Theorem" by a student of


>Isaac Levi, the philosopher from Columbia University.
>Take alpha very large (not just infinity) or very few data or no data
>then the posterior is still not a point but completely dominated by
>entropy so maximum a posteriori equals maximum entropy.

Maximum a posteriori estimation is not Bayesian.

Radford Neal

Michael J Hardy

unread,
Apr 4, 2002, 5:06:01 PM4/4/02
to
Herman Rubin (hru...@odds.stat.purdue.edu) wrote:

> I agree with Mike. I consider the use of maximum entropy
> to be an attempt to remove the prior from consideration,
> and as such, it is only good if the results it gives are
> similar.
>
> Like other such methods as "non-informative" priors, etc.,
> it is really anti-Bayesian.


Actually, I don't think a non-informative prior is inappropriate
to a situtation in which the person doing inference actually lacks
information. -- Mike Hardy

Carlos C. Rodriguez

unread,
Apr 4, 2002, 11:39:03 PM4/4/02
to
rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr4.1...@jarvis.cs.toronto.edu>...

Nice example. Wrong interpretation.
First of all, you can't quarrel with a theorem. The entropic prior for
the parameters of the mixture i.e. for the means, sds and weights is
proven to be the most difficult to discriminate from an independent
model on the space (data,parameters). Thus, in the absence of all
other information, WHATEVER PROPERTY THIS PRIOR HAS IS THE PROPERTY
THAT IT HAS TO HAVE in order to be the most ignorant about the data.
That's the beauty of mathematics. Once you accept the proof of Theorem
1 you are stuck with it. But that's not bad. That's the power of math.
Now you can go ahead and use the prior in 14 dimensional space without
having to worry about biasing the inferences with unjustified
assumptions. That's essentially the same reason why statistical
mechanics is so successful, as discovered long time ago by our beloved
guru E.T. (phone home) Jaynes and still, after all these years, unable
to be understood even by so reputable a wiff (well-in-formed-fellow)
as yourself who by the way even presented the problem of estimation of
mixtures with an infinite number of components at one of the MaxEnt
workshops.

OK back to the specifics of your gedankenexperiment. All the prior is
saying is that, in the absence of all other information, the means of
the rare components should be considered more uncertain than the means
of the common components. You may not like that but you have to live
with it. It doesn't matter weather you or I or anyone likes it or not.
If you say, for example: "what the heck I feel intuitively that an
ignorant prior should assign equal uncertainties to all the means
independently of the weights". Then Theorem 1 will tell you that your
intuitive feeling is a superstition. By the way, uncommon components
are observed less often than common ones so more a priori uncertainty
for the mean sounds good to me, again in the absence of all other
information.

Radford Neal

unread,
Apr 5, 2002, 9:48:03 AM4/5/02
to
>> Radford Neal:

>> Consider the problem in an example context: You are interested in
>> how far beetles travel during a day. With really advanced satellite
>> observation, you can track beetles flying around, but you can't
>> identify the species of beetle. You know there are five species of
>> beetle in a certain forest for which you have data. You therefore
>> model the distribution of distance travelled in a day as a mixture
>> of five normal distributions.
>>
>> Suppose we don't know much about how common the different species are,
>> or how much the beetles travel in a day - the situation to which you
>> say your method applies.
>>
>> The result of your method is a prior which says that the less common
>> beetles are likely to travel very far in a day, or not very far at
>> all, whereas the more common beetles are likely to travel a more
>> moderate distance. This seems to drastically depart from a prior that
>> embodies no precise information. It seems to correspond to a very
>> specific biological theory claiming that rare species have to either
>> travel a lot in a day (to avoid being set upon by gangs of competing
>> beatles?), or alternatively, to stay put. In no way can I accept that
>> this is a prior that will "let the data values speak for themselves".
>>

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>Nice example. Wrong interpretation.
>First of all, you can't quarrel with a theorem. The entropic prior for
>the parameters of the mixture i.e. for the means, sds and weights is
>proven to be the most difficult to discriminate from an independent
>model on the space (data,parameters). Thus, in the absence of all
>other information, WHATEVER PROPERTY THIS PRIOR HAS IS THE PROPERTY
>THAT IT HAS TO HAVE in order to be the most ignorant about the data.
>That's the beauty of mathematics. Once you accept the proof of Theorem
>1 you are stuck with it.

Why should I want a prior that is "most difficult to discriminate from
an independent model"? Or one that is "most ignorant about the data"?
And assuming I did want these things, why should I accept that your
mathematical formulation of what it means to be "ignorant" is the
correct one? These are not mathematical questions which can be
settled by a proof.

>OK back to the specifics of your gedankenexperiment. All the prior is
>saying is that, in the absence of all other information, the means of
>the rare components should be considered more uncertain than the means
>of the common components. You may not like that but you have to live
>with it. It doesn't matter weather you or I or anyone likes it or not.
>If you say, for example: "what the heck I feel intuitively that an
>ignorant prior should assign equal uncertainties to all the means
>independently of the weights". Then Theorem 1 will tell you that your
>intuitive feeling is a superstition.

No, what Theorem 1 tells ME is that your concept of "ignorance" is
flawed. Mathematical formulations of such concepts have to be tested
by checking their consequences in situations where intuitions are
clear. (After all, how else could one verify that the formulation is
correct?) Your formulation fails this test in this example.

>By the way, uncommon components
>are observed less often than common ones so more a priori uncertainty
>for the mean sounds good to me, again in the absence of all other
>information.

That is a good reason why the POSTERIOR uncertainty in the means of
the rare components will be greater. Why you think that this natural
effect of not having much data should be increased by also increasing
the PRIOR uncertainty is a mystery to me.

Radford Neal

Carlos C. Rodriguez

unread,
Apr 5, 2002, 10:21:44 AM4/5/02
to
Let me summarize the discussion. We have:
1) John Bailey: http://www.frontiernet.net/~jmb184/
2) Mike Hardy: http://www-math.mit.edu/~hardy/
3) Herman Rubin: http://www.stat.purdue.edu/people/hrubin/

Bailey: Entropy is an important concept in Bayesian Inference.
Hardy: Few people working in Bayesian Inference care about Entropy.
Rubin: The people that use entropy or whatever other so called
"neutral" priors are using unjustified computational copouts.

My position:
1) Hurray for Bailey!
2) Sure Mike but they should know better.
3) I disagree with Rubin's position with all the energy in my
reproductive system.

First of all, as far as it is known today, Entropy, Probability, and
(more recently discovered) Codes (as in binary codes) are pretty much
aspects of the same thing. At a fundamental level, Entropy is just the
number of available distinguishable possibilities in the (neg)-log
scale so that exp(Entropy)=1/N =Uniform probability over the space of
distinguishable states. Moreover, there is a one-to-one correspondence
between probability distributions and codes (or rather code lengths of
root-free (prefix) codes)
(e.g. see Grunwald's tutorial
http://quantrm2.psy.ohio-state.edu/injae/workshop.htm ) . Thus, any
one caring about the meaning and use of Probability theory (Bayesians
or members of the national riffle association alike) aught to care
about Entropy and Codes.

Second. More than seventy (70) years of DeFinetti/Savage subjectivism
have produce ZIP beyond beautiful sun tans from the coasts of Spain!

Third. Current action in fundamental statistical inference (aside from
computational issues) is about objective (or as objective as possible)
quantifications of prior information. Information geometry, MDL
principle, Entropic Priors, Bayesian Networks and Statistical Learning
Theory are pushing the envelope.

hru...@odds.stat.purdue.edu (Herman Rubin) wrote in message news:<a8fhr9$1q...@odds.stat.purdue.edu>...

Michael J Hardy

unread,
Apr 5, 2002, 12:20:38 PM4/5/02
to
Radford Neal (rad...@cs.toronto.edu) wrote:

> Why should I want a prior that is "most difficult to discriminate from
> an independent model"? Or one that is "most ignorant about the data"?


Your prior needs to incorporate your ignorance if you are ignorant.
Tomorrow's weather and the outcome of a coin toss are _conditionally_
_independent_given_my_knowledge_ if I have no knowledge of any connection
between them.

Mike Hardy

Radford Neal

unread,
Apr 5, 2002, 2:14:50 PM4/5/02
to
> Radford Neal (rad...@cs.toronto.edu) wrote:
>
>> Why should I want a prior that is "most difficult to discriminate from
>> an independent model"? Or one that is "most ignorant about the data"?

Michael J Hardy <mjh...@mit.edu> wrote:>
>
> Your prior needs to incorporate your ignorance if you are ignorant.

There's a logical gap between saying "this prior expresses ignorance
about the data" and "I'm ignorant, therefore I should use this prior".

The first statement implicitly assumes that there's only one possible
"state of ignorance". But it's not clear that real people can be
ignorant in only one way.

As evidence for this logical gap, one need only see that "objective"
Bayesians have come up with numerous priors that all supposedly
express ignorance. It's like the joke about standards for programming
languages - "If one standard is good, then three standards must be
even better!".

>Tomorrow's weather and the outcome of a coin toss are _conditionally_
>_independent_given_my_knowledge_ if I have no knowledge of any connection
>between them.

If you're SURE that there's no connection, then you're not ignorant at
all about the relationship (however ignorant you may be about
individual coin tosses and thunderstorms). If you're NOT sure that
there's no relationship, then the independence applies only to the
FIRST coin toss and thunderstorm. Once you are dealing with more than
one toss, you need to use a prior that expresses how likely the
various possible relationships are. This is related to the fallacy
behind Jaynes contention that the laws of statistical mechanics can be
derived from the maximum entropy principle, without the need for any
input of physical information.

Radford Neal

Carlos C. Rodriguez

unread,
Apr 5, 2002, 3:00:55 PM4/5/02
to
rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr5.0...@jarvis.cs.toronto.edu>...
> >> Radford Neal:

>
> Why should I want a prior that is "most difficult to discriminate from
> an independent model"? Or one that is "most ignorant about the data"?
> And assuming I did want these things, why should I accept that your
> mathematical formulation of what it means to be "ignorant" is the
> correct one? These are not mathematical questions which can be
> settled by a proof.

Recall: X and Y independent iff:
1) P(X|Y) = P(X)
and
2) P(Y|X) = P(Y)
provided both conditionals exist or more conveniently, but less
enlightening,
X and Y independent iff

P(X and Y) = P(X) P(Y)

By "X is ignorant about Y" I mean X is independent of Y. PERIOD.
How much more ignorant of each other can X and Y be?
Are you suggesting changing the meaning of independence?

>
> >OK back to the specifics of your gedankenexperiment. All the prior is
> >saying is that, in the absence of all other information, the means of
> >the rare components should be considered more uncertain than the means
> >of the common components. You may not like that but you have to live
> >with it. It doesn't matter weather you or I or anyone likes it or not.
> >If you say, for example: "what the heck I feel intuitively that an
> >ignorant prior should assign equal uncertainties to all the means
> >independently of the weights". Then Theorem 1 will tell you that your
> >intuitive feeling is a superstition.
>
> No, what Theorem 1 tells ME is that your concept of "ignorance" is
> flawed. Mathematical formulations of such concepts have to be tested

There is no mysterious concept of "ignorance" anymore. It is JUST
INDEPENDENCE!
(see the above)

> >By the way, uncommon components
> >are observed less often than common ones so more a priori uncertainty
> >for the mean sounds good to me, again in the absence of all other
> >information.
>
> That is a good reason why the POSTERIOR uncertainty in the means of
> the rare components will be greater. Why you think that this natural
> effect of not having much data should be increased by also increasing
> the PRIOR uncertainty is a mystery to me.
>

BECAUSE: by assumption the only information assumed is the likelihood.
The ignorant prior is only consistent with the info explicitly
provided, in this case by the likelihood. The parameters for the
uncommon components need to be obviously more uncertain otherwise you
would be claiming a source of information other than the likelihood.
Think about it this way. If you assume that you can ONLY learn about
the beetles by observing them, then you can only know more about the
ones that you can observe more. Whatever prior information you are
going to provide about the rare species of beetles would have to have
come from past observations and by assumption these are more scarce
ergo more prior uncertainty is just compatible with that.

Radford Neal

unread,
Apr 5, 2002, 4:00:09 PM4/5/02