# Explanation of Maximum Entropy

981 views

### John Uebersax

Aug 24, 2004, 10:12:52 AM8/24/04
to
Could anyone kindly suggest online resources that give a simple,
clear, basic explanation of Maximum Entropy as it applies statistical
estimation?

I understand that maxent may have advantages in certain applications,
like image processing. What is not clear to me is whether it has
practical implications for more routine applications. (In other
words, part of my question is whether this is a fundamental, broadly
applicable innovation--or revolution--within statistics).

If it helps, we can focus the question by posing the specific example
of coinflipping. If one observes 6 heads out of 10 flips of a
potentially biased coin, then one can use standard Bayesian methods to
infer the posterior probability distribution of Pr(heads)--i.e., the
probability density for each value of Pr(heads)in the range (0, 1).
If the prior distribution is uniform, then the posterior distribution
would have a specific beta distribution form with a mode of .6.

How would this be affected according to a Maximum Entropy approach?

p.s. I understand that the standard Bayesian approach might use a
better prior than a uniform distribution.

--
John Uebersax

### Ron Hardin

Aug 24, 2004, 11:20:52 AM8/24/04
to

There's a long and enlightening essay by Edwin T. Jaynes
Where do we stand on Maximum Entropy?'' in _The Maximum Entropy
Formalism_ Levine and Tribus, eds., MIT Press 1978

He derives dice geometry imperfections from R. Wolf's random experiments''
data from dice tossing from 1850-1890, comparing its power with
chi-squared. It's very well written, though I haven't read it for
years. It predicts a bright future for Maximum Entropy.

You get chi-squared by dropping some terms from Maximum Entropy, as I recall.
--
Ron Hardin
rhha...@mindspring.com

On the internet, nobody knows you're a jerk.

### Bob Ehrlich

Aug 24, 2004, 10:07:30 PM8/24/04
to
Given K samples each with l observations where l is large

Pool the K samples and so have K*l observations in the pooled sample

divide the sample into J class intervals where J<<Kl

define the class intervals such that each has kl/j observations; the CI
will commonly be of variable width (condition of maxent)

use these class intervals to for all samples.

The samples are now in a state of maximum unbiased contrast such that,
if there is a difference you have the best chance to detect it
(using,say, the K/S test) and you have the least chance of seeing
differences that are artifacts of your class intervals.

The only fly in the ointment is defining J

### John Bailey

Aug 25, 2004, 1:48:24 PM8/25/04
to
On 24 Aug 2004 07:12:52 -0700, jsueb...@yahoo.com (John Uebersax)
wrote:

>Could anyone kindly suggest online resources that give a simple,
>clear, basic explanation of Maximum Entropy as it applies statistical
>estimation?

http://omega.albany.edu:8008/JaynesBook provides E. T. Jaynes book on
Bayesian Probability available on-line. Chapter 11 deals with
Entropy. Jaynes literally wrote the book on the subject.

http://xyz.lanl.gov/abs/hep-ph/9512295 by G. D'Agostini, a Bayesian
Primer; a Probability and Measurement Uncertainty in Physics may be
the most accessible paper available on the internet.

http://omega.albany.edu:8008/maxent.html is Carlos Rodriguez' web
page, a collection of links on Maximum entropy.

Godambe's Paradox provides example of the breakdown of the Maximum
Entropy principle, which leads to discussions such as:
On the foundations of likelihood principle a paper by Christofaro
http://www.ds.unifi.it/ricerca/pagperson/docenti/varie_docenti/decristofaro/On%20the%20Foundations%20of%20Likelihood%20Principle.pdf

Lastly, http://xyz.lanl.gov/abs/quant-ph/0106125 is a sampling of
multiple efforts to derive all of physics from statistical entropic
concepts.

Aug 25, 2004, 2:39:03 PM8/25/04
to
John Uebersax <jsueb...@yahoo.com> wrote:

>Could anyone kindly suggest online resources that give a simple,
>clear, basic explanation of Maximum Entropy as it applies statistical
>estimation?

No. It's not possible to explain this method because it doesn't make
any sense. The early early papers of Jaynes may seem superficially
persuasive, but as soon as you try to pin things down, they make no sense.

They standard maximum entropy approach is to find the distribution
with maximum entropy subject to the expectations of certain functions
having given values. Supposedly, one is supposed to have "observed"
that the expectations have these values. But the probability
distribution in question is also said to subjective, in the sense that
it reflects the knowledge/beliefs of some particular person. You
can't "observe" expectations with respect to a subjective distribution,
especially since you haven't fixed this distribution yet (else you wouldn't
need maximum entropy).

>If it helps, we can focus the question by posing the specific example
>of coinflipping. If one observes 6 heads out of 10 flips of a
>potentially biased coin, then one can use standard Bayesian methods to
>infer the posterior probability distribution of Pr(heads)--i.e., the
>probability density for each value of Pr(heads)in the range (0, 1).
>If the prior distribution is uniform, then the posterior distribution
>would have a specific beta distribution form with a mode of .6.
>
>How would this be affected according to a Maximum Entropy approach?

Maximum entropy is not consistent with Bayesian methods. The maximum
entropy people seem to have realized this about ten years ago, and have
therefore more-or-less abandoned the method (rather quietly, however).

----------------------------------------------------------------------------
Dept. of Statistics and Dept. of Computer Science rad...@utstat.utoronto.ca
----------------------------------------------------------------------------

### John Bailey

Aug 25, 2004, 7:37:56 PM8/25/04
to
wrote:

>John Uebersax <jsueb...@yahoo.com> wrote:
>
>>Could anyone kindly suggest online resources that give a simple,
>>clear, basic explanation of Maximum Entropy as it applies statistical
>>estimation?
>
>No. It's not possible to explain this method because it doesn't make
>any sense. The early early papers of Jaynes may seem superficially
>persuasive, but as soon as you try to pin things down, they make no sense.

a replay of a thread on this subject on this newgroup to which Neal
contributed considerably along this line. The thread also contains
Rodriguez' rebut of these assertions.

I will leave the judgement of rigor to mathematicians. (Quis justodiet
ipsos justodes?) Judgments about usefulness are the perogative of
engineers. As an engineer, I have found Jaynes and the works those
building on his work to be enormously valuable. I am appalled at the
amount of religious fervor attending the criticism of Bayesian
methods. To me they are logical and extremely useful. If they are
flawed in rigor, perhaps a Laplace or Fourier to his Heavyside will
emerge.
John Bailey

### Aleks Jakulin

Aug 26, 2004, 12:17:22 PM8/26/04
to
John Uebersax wrote:
> Could anyone kindly suggest online resources that give a simple,
> clear, basic explanation of Maximum Entropy as it applies
> statistical estimation?

I am fond of the interpretation in this paper:

P.D. Grünwald and A.P. Dawid. Game theory, maximum entropy, minimum
discrepancy, and robust Bayesian decision theory. Annals of Statistics
32(4), pages 1367-1433, 2004. (http://www.cwi.nl/~pdg/ftp/AOS231.pdf)

Basically, entropy is very much alike a loss function: it is the
expected log-likelihood of a sample from a probability mass function;
differential entropy is the expected log-(likelihood-density) of a
sample from a PDF. Say you have a number of probabilistic models, and
you're unsure of which one to pick. The right thing to do is to pick
the most timid one, the one that has the highest entropy of all that
are consistent with the constraints. The Gaussian distribution is the
maximum entropy distribution if your constraints are the first two
moments, for example.

You can view maximum entropy as the timid mirror face of the bold
maximum likelihood. Most models have both faces: we find maximum
likelihood parameters of maximum entropy distributions. Or, you can
maximize the entropy given likelihood-maximizing constraints. If you
excuse my use of esotericism: MaxEnt is yin, maximum likelihood is
yang.

> If it helps, we can focus the question by posing the specific

> of coinflipping. If one observes 6 heads out of 10 flips of a
> potentially biased coin, then one can use standard Bayesian methods

> infer the posterior probability distribution of Pr(heads)--i.e.,

> probability density for each value of Pr(heads)in the range (0, 1).
> If the prior distribution is uniform, then the posterior

> would have a specific beta distribution form with a mode of .6.
>
> How would this be affected according to a Maximum Entropy approach?

With the constraint of having two outcomes, MaxEnt would yield you the
PMF of [0.5,0.5]. You could use this as a prior in a Bayesian
approach. There is a general habit of using maximum entropy priors in
Bayesian statistics.

A more interesting example is having two coins. The maximum entropy
model would assume the coins to be independent. As you may know,
loglinear models fitted with a GIS procedure have the property of
having maximum entropy given the marginals as constraints. E.g.:

Good, I. J. (1963). Maximum entropy for hypothesis formulation. The
Annals of Mathematical Statistics, 34, 911-934.

--
mag. Aleks Jakulin
http://www.ailab.si/aleks/
Artificial Intelligence Laboratory,
Faculty of Computer and Information Science,
University of Ljubljana,
Slovenia.

### Aleks Jakulin

Aug 27, 2004, 3:26:12 AM8/27/04
to

> Maximum entropy is not consistent with Bayesian methods.

You should interpret maximum theory as decision-theoretic model/prior
selection, which follows or precedes Bayesian modelling. There are
several practical implementations sample around the _constrained_
posterior space, picking the maximum entropy model found. The maximum
entropy model is interpreted as the most robust, the smoothest, the
worst-case optimal.

> The maximum
> entropy people seem to have realized this about ten years ago, and

> therefore more-or-less abandoned the method (rather quietly,
> however).

It is my impression that MaxEnt is making a comeback the past two
years, judging from conferences such as ICML and UAI.

Aug 29, 2004, 11:13:35 AM8/29/04
to

>> If it helps, we can focus the question by posing the specific
>> of coinflipping. If one observes 6 heads out of 10 flips of a
>> potentially biased coin, then one can use standard Bayesian methods...

>>
>> How would this be affected according to a Maximum Entropy approach?

In article <cgl2ei$mso$1...@planja.arnes.si>,

Aleks Jakulin <a_jakulin@@hotmail.com> wrote:
>
>With the constraint of having two outcomes, MaxEnt would yield you the
>PMF of [0.5,0.5].

Why? The data is 6 heads out of 10, not 5 out of 10. Shouldn't the
best distribution pay at least a little attention to the data?

I suspect you have in mind some sort of frequentist test that fails to
reject the null hypothesis that the distribution is [0.5,0.5], and
since this distribution has maximum entropy, you go for it. Of
course, you WILL reject this null if you set the significance level
high enough. So the maximum entropy result, often touted as being an
"objective" solution to the problem, actually depends on a totally
arbitrary choice of significance level.

### Aleks Jakulin

Aug 29, 2004, 1:44:13 PM8/29/04
to
> Aleks Jakulin <a_jakulin@@hotmail.com> wrote:
> >
> >With the constraint of having two outcomes, MaxEnt would yield you
> >PMF of [0.5,0.5].

> Why? The data is 6 heads out of 10, not 5 out of 10. Shouldn't the
> best distribution pay at least a little attention to the data?

MaxEnt isn't very useful for the coin toss example. What would be a
non-trivial constraint derived from the data that would not determine
the PMF? All I can imagine MaxEnt to do in this case is

a) (maximum entropy prior) if you apply it to determining the
parameters of a Beta prior, it would prescribe you (any) _symmetric_
Beta prior (should your constraint be a Beta distribution with a
priors. Has anyone seen any practician seriously using those?

b) (maximum entropy model) [0.5,0.5] (in the absence of sensible
constraints) and

c) (maximum entropy posterior) of all the posterior PMF's, select the
one with the maximum entropy; So, under c) one wouldn't integrate over
the prior, and would not use the bold MAP

h' = argmax_h P(h|d)

but instead use the timid MEP as in

h' = argmax_h H(P(D|h)) where h is a particular hypothesis with a
non-zero posterior: P(h|d) > 0

here H(P(D|h)) = -Sum[ P(x) log P(x); x \in D ] (D is input space, x
is a point in D, d is the given input sample {x1,x2,...,xn})

For this to be non-trivial, the prior should be zero for some
parameter settings.

> I suspect you have in mind some sort of frequentist test that fails

> reject the null hypothesis that the distribution is [0.5,0.5], and
> since this distribution has maximum entropy, you go for it.

Not this time. :)

Aleks

### Tom Loredo

Aug 31, 2004, 7:29:00 PM8/31/04
to

Hi folks-

>
> No. It's not possible to explain this method because it doesn't make
> any sense.

Actually, I think it makes plenty of sense. It simply solves a class
of problems that never arises in practice. (Well, maybe not quite
"never.")

> They standard maximum entropy approach is to find the distribution
> with maximum entropy subject to the expectations of certain functions
> having given values. Supposedly, one is supposed to have "observed"
> that the expectations have these values. But the probability
> distribution in question is also said to subjective, in the sense that
> it reflects the knowledge/beliefs of some particular person. You
> can't "observe" expectations with respect to a subjective distribution,
> especially since you haven't fixed this distribution yet (else you wouldn't
> need maximum entropy).

Yes, this is the crux of the issue. I think that if somehow one
has nontrivial "testable information" (as Jaynes called it), the
MaxEnt approach to assigning a prior based on this info is sound
(my favorite derivation is that of Shore & Johnson, in part because
of how explicitly it notes that one must still have a prior
dist'n as input to MaxEnt). It seems to me that in some settings
it might be possible to have useful testable information (perhaps
from a symmetry argument or a theoretical calculation in a
physics setting). But unfortunately, none of the examples
commonly given in a MaxEnt analysis really have testable information.

> Maximum entropy is not consistent with Bayesian methods. The maximum
> entropy people seem to have realized this about ten years ago, and have
> therefore more-or-less abandoned the method (rather quietly, however).

Well, I think Radford is here referring to MaxEnt image analysis.
I think in that setting entropy-based priors have no fundamental
justification (which is not to say that they might not prove
useful in some settings), and have been largely abandoned by their
earliest and most vocal advocates. But this is different from
using the principle of maximum entropy to assign probability
distributions using testable information. (One of the problems
with "MaxEnt methods" is that multiple methods share the same
name!) This latter application appears valid to me and
provides a missing piece of the Bayesian puzzle in a particular
class of problems. I've just never seen a real problem arise
in that class.

Cheers,
Tom

--

To respond by email, replace "somewhere" with "astro" in the