988 views

Skip to first unread message

Aug 24, 2004, 10:12:52 AM8/24/04

to

Could anyone kindly suggest online resources that give a simple,

clear, basic explanation of Maximum Entropy as it applies statistical

estimation?

clear, basic explanation of Maximum Entropy as it applies statistical

estimation?

I understand that maxent may have advantages in certain applications,

like image processing. What is not clear to me is whether it has

practical implications for more routine applications. (In other

words, part of my question is whether this is a fundamental, broadly

applicable innovation--or revolution--within statistics).

If it helps, we can focus the question by posing the specific example

of coinflipping. If one observes 6 heads out of 10 flips of a

potentially biased coin, then one can use standard Bayesian methods to

infer the posterior probability distribution of Pr(heads)--i.e., the

probability density for each value of Pr(heads)in the range (0, 1).

If the prior distribution is uniform, then the posterior distribution

would have a specific beta distribution form with a mode of .6.

How would this be affected according to a Maximum Entropy approach?

p.s. I understand that the standard Bayesian approach might use a

better prior than a uniform distribution.

--

John Uebersax

Aug 24, 2004, 11:20:52 AM8/24/04

to

There's a long and enlightening essay by Edwin T. Jaynes

``Where do we stand on Maximum Entropy?'' in _The Maximum Entropy

Formalism_ Levine and Tribus, eds., MIT Press 1978

He derives dice geometry imperfections from R. Wolf's ``random experiments''

data from dice tossing from 1850-1890, comparing its power with

chi-squared. It's very well written, though I haven't read it for

years. It predicts a bright future for Maximum Entropy.

You get chi-squared by dropping some terms from Maximum Entropy, as I recall.

--

Ron Hardin

rhha...@mindspring.com

On the internet, nobody knows you're a jerk.

Aug 24, 2004, 10:07:30 PM8/24/04

to

Pool the K samples and so have K*l observations in the pooled sample

divide the sample into J class intervals where J<<Kl

define the class intervals such that each has kl/j observations; the CI

will commonly be of variable width (condition of maxent)

use these class intervals to for all samples.

The samples are now in a state of maximum unbiased contrast such that,

if there is a difference you have the best chance to detect it

(using,say, the K/S test) and you have the least chance of seeing

differences that are artifacts of your class intervals.

The only fly in the ointment is defining J

Aug 25, 2004, 1:48:24 PM8/25/04

to

On 24 Aug 2004 07:12:52 -0700, jsueb...@yahoo.com (John Uebersax)

wrote:

wrote:

>Could anyone kindly suggest online resources that give a simple,

>clear, basic explanation of Maximum Entropy as it applies statistical

>estimation?

http://omega.albany.edu:8008/JaynesBook provides E. T. Jaynes book on

Bayesian Probability available on-line. Chapter 11 deals with

Entropy. Jaynes literally wrote the book on the subject.

http://xyz.lanl.gov/abs/hep-ph/9512295 by G. D'Agostini, a Bayesian

Primer; a Probability and Measurement Uncertainty in Physics may be

the most accessible paper available on the internet.

http://omega.albany.edu:8008/maxent.html is Carlos Rodriguez' web

page, a collection of links on Maximum entropy.

Godambe's Paradox provides example of the breakdown of the Maximum

Entropy principle, which leads to discussions such as:

On the foundations of likelihood principle a paper by Christofaro

http://www.ds.unifi.it/ricerca/pagperson/docenti/varie_docenti/decristofaro/On%20the%20Foundations%20of%20Likelihood%20Principle.pdf

Lastly, http://xyz.lanl.gov/abs/quant-ph/0106125 is a sampling of

multiple efforts to derive all of physics from statistical entropic

concepts.

John Bailey

http://home.rochester.rr.com/jbxroads/mailto.html

Aug 25, 2004, 2:39:03 PM8/25/04

to

In article <f6f3f967.04082...@posting.google.com>,

John Uebersax <jsueb...@yahoo.com> wrote:

John Uebersax <jsueb...@yahoo.com> wrote:

>Could anyone kindly suggest online resources that give a simple,

>clear, basic explanation of Maximum Entropy as it applies statistical

>estimation?

No. It's not possible to explain this method because it doesn't make

any sense. The early early papers of Jaynes may seem superficially

persuasive, but as soon as you try to pin things down, they make no sense.

They standard maximum entropy approach is to find the distribution

with maximum entropy subject to the expectations of certain functions

having given values. Supposedly, one is supposed to have "observed"

that the expectations have these values. But the probability

distribution in question is also said to subjective, in the sense that

it reflects the knowledge/beliefs of some particular person. You

can't "observe" expectations with respect to a subjective distribution,

especially since you haven't fixed this distribution yet (else you wouldn't

need maximum entropy).

>If it helps, we can focus the question by posing the specific example

>of coinflipping. If one observes 6 heads out of 10 flips of a

>potentially biased coin, then one can use standard Bayesian methods to

>infer the posterior probability distribution of Pr(heads)--i.e., the

>probability density for each value of Pr(heads)in the range (0, 1).

>If the prior distribution is uniform, then the posterior distribution

>would have a specific beta distribution form with a mode of .6.

>

>How would this be affected according to a Maximum Entropy approach?

Maximum entropy is not consistent with Bayesian methods. The maximum

entropy people seem to have realized this about ten years ago, and have

therefore more-or-less abandoned the method (rather quietly, however).

Radford Neal

----------------------------------------------------------------------------

Radford M. Neal rad...@cs.utoronto.ca

Dept. of Statistics and Dept. of Computer Science rad...@utstat.utoronto.ca

University of Toronto http://www.cs.utoronto.ca/~radford

----------------------------------------------------------------------------

Aug 25, 2004, 7:37:56 PM8/25/04

to

On 25 Aug 2004 18:39:03 GMT, rad...@cs.toronto.edu (Radford Neal)

wrote:

wrote:

>In article <f6f3f967.04082...@posting.google.com>,

>John Uebersax <jsueb...@yahoo.com> wrote:

>

>>Could anyone kindly suggest online resources that give a simple,

>>clear, basic explanation of Maximum Entropy as it applies statistical

>>estimation?

>

>No. It's not possible to explain this method because it doesn't make

>any sense. The early early papers of Jaynes may seem superficially

>persuasive, but as soon as you try to pin things down, they make no sense.

See http://home.rochester.rr.com/jbxroads/interests/sci.stat.math/ for

a replay of a thread on this subject on this newgroup to which Neal

contributed considerably along this line. The thread also contains

Rodriguez' rebut of these assertions.

I will leave the judgement of rigor to mathematicians. (Quis justodiet

ipsos justodes?) Judgments about usefulness are the perogative of

engineers. As an engineer, I have found Jaynes and the works those

building on his work to be enormously valuable. I am appalled at the

amount of religious fervor attending the criticism of Bayesian

methods. To me they are logical and extremely useful. If they are

flawed in rigor, perhaps a Laplace or Fourier to his Heavyside will

emerge.

John Bailey

http://home.rochester.rr.com/jbxroads/mailto.html

Aug 26, 2004, 12:17:22 PM8/26/04

to

John Uebersax wrote:

> Could anyone kindly suggest online resources that give a simple,

> clear, basic explanation of Maximum Entropy as it applies

> statistical estimation?

> Could anyone kindly suggest online resources that give a simple,

> clear, basic explanation of Maximum Entropy as it applies

> statistical estimation?

I am fond of the interpretation in this paper:

P.D. Grünwald and A.P. Dawid. Game theory, maximum entropy, minimum

discrepancy, and robust Bayesian decision theory. Annals of Statistics

32(4), pages 1367-1433, 2004. (http://www.cwi.nl/~pdg/ftp/AOS231.pdf)

Basically, entropy is very much alike a loss function: it is the

expected log-likelihood of a sample from a probability mass function;

differential entropy is the expected log-(likelihood-density) of a

sample from a PDF. Say you have a number of probabilistic models, and

you're unsure of which one to pick. The right thing to do is to pick

the most timid one, the one that has the highest entropy of all that

are consistent with the constraints. The Gaussian distribution is the

maximum entropy distribution if your constraints are the first two

moments, for example.

You can view maximum entropy as the timid mirror face of the bold

maximum likelihood. Most models have both faces: we find maximum

likelihood parameters of maximum entropy distributions. Or, you can

maximize the entropy given likelihood-maximizing constraints. If you

excuse my use of esotericism: MaxEnt is yin, maximum likelihood is

yang.

> If it helps, we can focus the question by posing the specific

> of coinflipping. If one observes 6 heads out of 10 flips of a

> potentially biased coin, then one can use standard Bayesian methods

> infer the posterior probability distribution of Pr(heads)--i.e.,

> probability density for each value of Pr(heads)in the range (0, 1).

> If the prior distribution is uniform, then the posterior

> would have a specific beta distribution form with a mode of .6.

>

> How would this be affected according to a Maximum Entropy approach?

With the constraint of having two outcomes, MaxEnt would yield you the

PMF of [0.5,0.5]. You could use this as a prior in a Bayesian

approach. There is a general habit of using maximum entropy priors in

Bayesian statistics.

A more interesting example is having two coins. The maximum entropy

model would assume the coins to be independent. As you may know,

loglinear models fitted with a GIS procedure have the property of

having maximum entropy given the marginals as constraints. E.g.:

Good, I. J. (1963). Maximum entropy for hypothesis formulation. The

Annals of Mathematical Statistics, 34, 911-934.

--

mag. Aleks Jakulin

http://www.ailab.si/aleks/

Artificial Intelligence Laboratory,

Faculty of Computer and Information Science,

University of Ljubljana,

Slovenia.

Aug 27, 2004, 3:26:12 AM8/27/04

to

Radford Neal wrote:

> Maximum entropy is not consistent with Bayesian methods.

You should interpret maximum theory as decision-theoretic model/prior

selection, which follows or precedes Bayesian modelling. There are

several practical implementations sample around the _constrained_

posterior space, picking the maximum entropy model found. The maximum

entropy model is interpreted as the most robust, the smoothest, the

worst-case optimal.

> The maximum

> entropy people seem to have realized this about ten years ago, and

> therefore more-or-less abandoned the method (rather quietly,

> however).

It is my impression that MaxEnt is making a comeback the past two

years, judging from conferences such as ICML and UAI.

Aug 29, 2004, 11:13:35 AM8/29/04

to

>> If it helps, we can focus the question by posing the specific

>> of coinflipping. If one observes 6 heads out of 10 flips of a

>>

>> How would this be affected according to a Maximum Entropy approach?

In article <cgl2ei$mso$1...@planja.arnes.si>,

Aleks Jakulin <a_jakulin@@hotmail.com> wrote:

>

>With the constraint of having two outcomes, MaxEnt would yield you the

>PMF of [0.5,0.5].

Why? The data is 6 heads out of 10, not 5 out of 10. Shouldn't the

best distribution pay at least a little attention to the data?

I suspect you have in mind some sort of frequentist test that fails to

reject the null hypothesis that the distribution is [0.5,0.5], and

since this distribution has maximum entropy, you go for it. Of

course, you WILL reject this null if you set the significance level

high enough. So the maximum entropy result, often touted as being an

"objective" solution to the problem, actually depends on a totally

arbitrary choice of significance level.

Radford Neal

Aug 29, 2004, 1:44:13 PM8/29/04

to

> Aleks Jakulin <a_jakulin@@hotmail.com> wrote:

> >

> >With the constraint of having two outcomes, MaxEnt would yield you

> >PMF of [0.5,0.5].> >

> >With the constraint of having two outcomes, MaxEnt would yield you

Radford Neal responded:

> Why? The data is 6 heads out of 10, not 5 out of 10. Shouldn't the

> best distribution pay at least a little attention to the data?

MaxEnt isn't very useful for the coin toss example. What would be a

non-trivial constraint derived from the data that would not determine

the PMF? All I can imagine MaxEnt to do in this case is

a) (maximum entropy prior) if you apply it to determining the

parameters of a Beta prior, it would prescribe you (any) _symmetric_

Beta prior (should your constraint be a Beta distribution with a

certain count). I've had rather bad results with Bernardo's reference

priors. Has anyone seen any practician seriously using those?

b) (maximum entropy model) [0.5,0.5] (in the absence of sensible

constraints) and

c) (maximum entropy posterior) of all the posterior PMF's, select the

one with the maximum entropy; So, under c) one wouldn't integrate over

the prior, and would not use the bold MAP

h' = argmax_h P(h|d)

but instead use the timid MEP as in

h' = argmax_h H(P(D|h)) where h is a particular hypothesis with a

non-zero posterior: P(h|d) > 0

here H(P(D|h)) = -Sum[ P(x) log P(x); x \in D ] (D is input space, x

is a point in D, d is the given input sample {x1,x2,...,xn})

For this to be non-trivial, the prior should be zero for some

parameter settings.

> I suspect you have in mind some sort of frequentist test that fails

> reject the null hypothesis that the distribution is [0.5,0.5], and

> since this distribution has maximum entropy, you go for it.

Not this time. :)

Aleks

Aug 31, 2004, 7:29:00 PM8/31/04

to

Hi folks-

Radford Neal wrote:

>

> No. It's not possible to explain this method because it doesn't make

> any sense.

Actually, I think it makes plenty of sense. It simply solves a class

of problems that never arises in practice. (Well, maybe not quite

"never.")

> They standard maximum entropy approach is to find the distribution

> with maximum entropy subject to the expectations of certain functions

> having given values. Supposedly, one is supposed to have "observed"

> that the expectations have these values. But the probability

> distribution in question is also said to subjective, in the sense that

> it reflects the knowledge/beliefs of some particular person. You

> can't "observe" expectations with respect to a subjective distribution,

> especially since you haven't fixed this distribution yet (else you wouldn't

> need maximum entropy).

Yes, this is the crux of the issue. I think that if somehow one

has nontrivial "testable information" (as Jaynes called it), the

MaxEnt approach to assigning a prior based on this info is sound

(my favorite derivation is that of Shore & Johnson, in part because

of how explicitly it notes that one must still have a prior

dist'n as input to MaxEnt). It seems to me that in some settings

it might be possible to have useful testable information (perhaps

from a symmetry argument or a theoretical calculation in a

physics setting). But unfortunately, none of the examples

commonly given in a MaxEnt analysis really have testable information.

> Maximum entropy is not consistent with Bayesian methods. The maximum

> entropy people seem to have realized this about ten years ago, and have

> therefore more-or-less abandoned the method (rather quietly, however).

Well, I think Radford is here referring to MaxEnt image analysis.

I think in that setting entropy-based priors have no fundamental

justification (which is not to say that they might not prove

useful in some settings), and have been largely abandoned by their

earliest and most vocal advocates. But this is different from

using the principle of maximum entropy to assign probability

distributions using testable information. (One of the problems

with "MaxEnt methods" is that multiple methods share the same

name!) This latter application appears valid to me and

provides a missing piece of the Bayesian puzzle in a particular

class of problems. I've just never seen a real problem arise

in that class.

Cheers,

Tom

--

To respond by email, replace "somewhere" with "astro" in the

return address.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu