Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

554 views

Skip to first unread message

Mar 28, 2002, 10:33:24 PM3/28/02

to

I'm interested in locating fundamental work in maximum entropy imputation

for simple data tables.

for simple data tables.

I've done a Google search for papers on imputation via maximum entropy but

found virtually nothing except a paper from Russia which seems to be

concerned primarily with longitudinal statistics or process data.

Mar 29, 2002, 7:51:29 AM3/29/02

to

Maximum entropy is implied for any technique using Bayesina Inference,

nit wahr?

I somewhat casually selected these by screening a search using the

keyword *imputation*

Missing Data, Censored Data, and Multiple Imputation

http://cm.bell-labs.com/cm/ms/departments/sia/project/mi/index.html

Bayesian Statistics

http://cm.bell-labs.com/cm/ms/departments/sia/project/bayes/index.html

Multiple Imputation

http://www.stat.ucla.edu/~mhu/impute.html

"Multiple Imputation for Missing Data: Concepts and New Development"

http://www.sas.com/rnd/app/papers/multipleimputation.pdf

Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys,

New York: John Wiley & Sons, Inc.

Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, New

York: Chapman and Hall

http://www.sas.com/rnd/app/da/new/pdf/dami.pdf

Multiple Imputation References

http://www.statsol.ie/solas/sorefer.htm

John

Mar 29, 2002, 10:14:36 AM3/29/02

to

Perhaps something called the "Berg" method of maximum entropy might fill the

bill. It is used in signal processing.

bill. It is used in signal processing.

Mar 29, 2002, 3:27:04 PM3/29/02

to

John Bailey (jmb...@frontiernet.net) wrote:

> Maximum entropy is implied for any technique using Bayesina Inference,

> nit wahr?

No, I don't think so. Why would you say that? -- Mike Hardy

Mar 29, 2002, 6:18:08 PM3/29/02

to

On 29 Mar 2002 20:27:04 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

> John Bailey (jmb...@frontiernet.net) wrote:

>

>> Maximum entropy is implied for any technique using Bayesian Inference,

>> nit wahr?

>

>

> No, I don't think so. Why would you say that? -- Mike Hardy

Except for the No, I don't think so, I would have thought you were

being sarcastic, that I had stated the obvious.

Let me say it a different way. Are there statistical techniques for

obtaining maximum entropy estimates which are not Bayesian?

Are they sufficiently well known as to be suitable in a general search

of the web for references?

John

Mar 29, 2002, 6:46:33 PM3/29/02

to

On Fri, 29 Mar 2002 23:18:08 GMT, jmb...@frontiernet.net (John Bailey)

wrote:

>Let me say it a different way. Are there statistical techniques for

>obtaining maximum entropy estimates which are not Bayesian?

>

>Are they sufficiently well known as to be suitable in a general search

>of the web for references?

wrote:

>Let me say it a different way. Are there statistical techniques for

>obtaining maximum entropy estimates which are not Bayesian?

>

>Are they sufficiently well known as to be suitable in a general search

>of the web for references?

http://www-2.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html

barely mentions Bayesian techniques, except for a throwaway line that

Bayesians use "fuzzy maximum entropy".

I don't fully follow how maximum entropy works, but might it be

possible to use maximum liklihood techniques in some cases?

Mar 29, 2002, 9:33:09 PM3/29/02

to

On Fri, 29 Mar 2002 23:46:33 +0000 (UTC), se...@btinternet.com (Henry)

wrote:

wrote:

from http://www.aas.org/publications/baas/v32n4/aas197/352.htm

A Maximum-Entropy Approach to Hypothesis Testing: An

Alternative to the p-Value Approach

P.A. Sturrock (Stanford University)

(quoting)

In problems of the Bernoulli type, an experiment or observation yields

a count of the number of occurrences of an event, and this count is

compared with what it to be expected on the basis of a specified and

unremarkable hypothesis. The goal is to determine whether the results

support the specified hypothesis, or whether they indicate that some

extraordinary process is at work. This evaluation is often based on

the ``p-value" test according to which one calculates, on the basis of

the specific hypothesis, the probability of obtaining the actual

result or a ``more extreme" result. Textbooks caution that the p-value

does not give the probability that the specific hypothesis is true,

and one recent textbook asserts ``Although that might be a more

interesting question to answer, there is no way to answer it."

The Bayesian approach does make it possible to answer this question.

As in any Bayesian analysis, it requires that we consider not just one

hypothesis but a complete set of hypotheses. This may be achieved very

simply by supplementing the specific hypothesis with the

maximum-entropy hypothesis that covers all other possibilities in

a way that is maximally non-committal. This procedure yields an

estimate of the probability that the specific hypothesis is true. This

estimate is found to be more conservative than that which one might

infer from the p-value test.

(end quote)

Mar 30, 2002, 5:19:02 PM3/30/02

to

John Bailey (jmb...@frontiernet.net) wrote:

> Maximum entropy is implied for any technique using Bayesian Inference,

> nit wahr?

I answered:

> No, I don't think so. Why would you say that? -- Mike Hardy

He replied:

> Except for the No, I don't think so, I would have thought you were

> being sarcastic, that I had stated the obvious.

I think you're very confused.

> Let me say it a different way. Are there statistical techniques

> for obtaining maximum entropy estimates which are not Bayesian?

Oh God. *First* you said Maximum entropy is "implied for" any

technique using Bayesian inference; now you seem to be saying the

exact converse of that --- the other way around. I seldom see anyone

write less clearly.

Bayesianism is the belief in, or use of, a degree-of-belief

interpretation of probability. Your *first* posting seemed to say

for any technique using Bayesian inference, i.e., using a degree-of-

belief interpretation of probability, "maximum entropy is implied."

In fact it's perfectly routine to do Bayesian inference without ever

thinking about entropy at all. Your *next* posting seems to say,

*not* that Bayesian inference implies maximum likelihood, but the

reverse: that maximum likelihood implies Bayesian inference. Which

did you mean? Or did you mean both? Why can't you be clear about

that?

To answer your question: algorithms for obtaining estimates do

not require degree-of-belief interpretations of probability; they

don't require frequency interpretations; they don't require any

interpretations at all. So they don't have to be Bayesian. And I

doubt that any of them are in any way Bayesian, even if they are

relied on in doing Bayesian inference.

Mike Hardy

Mar 30, 2002, 5:24:15 PM3/30/02

to

> John Bailey (jmb...@frontiernet.net) wrote:

>

>> Maximum entropy is implied for any technique using Bayesina Inference,

>> nit wahr?

>

>> Maximum entropy is implied for any technique using Bayesina Inference,

>> nit wahr?

No. Bayesian inference and maximum entropy methods as originally

defined are in fact incompatible. This is hardly surprising - if

you're setting probabilities by maximum entropy, it would be a big

coincidence if they turned out to be the same as what one would get by

a completely different method.

You're probably confused by the tendency of many of the old maximum

entropy advocates to claim that they were Bayesians. This just shows

that words and reality don't necessarily match.

More recently, the maximum entropy folks have pretty much abandoned

the old version of maximum entropy in favour of Bayesian methods using

priors that are defined in terms of entropy functions. This is

incompatible with the old maximum entropy methods. These priors may

be useful now and then, but there's no reason to limit yourself to them.

Radford Neal

Mar 30, 2002, 8:30:15 PM3/30/02

to

On 30 Mar 2002 22:19:02 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

> John Bailey (jmb...@frontiernet.net) wrote:

>> Let me say it a different way. Are there statistical techniques

>> for obtaining maximum entropy estimates which are not Bayesian?

> Oh God. *First* you said Maximum entropy is "implied for" any

>technique using Bayesian inference; now you seem to be saying the

>exact converse of that --- the other way around. I seldom see anyone

>write less clearly.

(that coming from someone who's first post was so obscure everyone

missed his point.)

(snipped)

> To answer your question: algorithms for obtaining estimates do

>not require degree-of-belief interpretations of probability; they

>don't require frequency interpretations; they don't require any

>interpretations at all. So they don't have to be Bayesian. And I

>doubt that any of them are in any way Bayesian, even if they are

>relied on in doing Bayesian inference.

I think the cause of the confusion is that my focus was the pragmatic

challenge of finding appropriate key words or phrases for an effective

web search and you are hung up on the religious aspects of Bayesian

vs frequentists theology.

Just in case anyone listening in has doubts about this aspect of a

perfectly innocent estimation technique I commend the following

presentation:

http://www.google.com/url?sa=U&start=22&q=http://umaxp1.physics.lsa.umich.edu/~kelly/bayes/intro_talk.ppt&e=933

Too bad its in Powerpoint but I recommend it anyway.

Here is a excerpt:

Bayesian/Frequentist results approach mathematical identity only if:

BPT uses priors with high degree of ignorance,

there are sufficient statistics, and

FPT distribution depends only on the sufficient statistic, and that it

is randomly distributed about the true value.

This convergence is seen as coincidental. (end quote)

John

Mar 31, 2002, 3:18:04 PM3/31/02

to

John Bailey (jmb...@frontiernet.net) wrote:

> I think the cause of the confusion is that my focus was the pragmatic

> challenge of finding appropriate key words or phrases for an effective

> web search and you are hung up on the religious aspects of Bayesian

> vs frequentists theology.

>

> Just in case anyone listening in has doubts about this aspect of a

> perfectly innocent estimation technique I commend the following

I don't think anyone in this thread questioned any aspects of

any estimation technique.

Look: Most practitioners of Bayesian inference probably do not

know what entropy is. That appears to contradict what you said in

your posting that I first answered. Can you dispute that?

> presentation:

> http://www.google.com/url?sa=U&start=22&q=http://umaxp1.physics.lsa.umich.edu/~kelly/bayes/intro_talk.ppt&e=933

> Too bad its in Powerpoint but I recommend it anyway.

> Here is a excerpt:

> Bayesian/Frequentist results approach mathematical identity only if:

> BPT uses priors with high degree of ignorance,

> there are sufficient statistics, and

> FPT distribution depends only on the sufficient statistic, and that it

> is randomly distributed about the true value.

> This convergence is seen as coincidental. (end quote)

I have no idea what kind of software would be needed to read this

document, so at this point it's entirely illegible to me. What do you

mean by "BPT" and "FTP"?

Mike Hardy

Mar 31, 2002, 5:32:51 PM3/31/02

to

On 31 Mar 2002 20:18:04 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

> John Bailey (jmb...@frontiernet.net) wrote:

>

>> I think the cause of the confusion is that my focus was the pragmatic

>> challenge of finding appropriate key words or phrases for an effective

>> web search and you are hung up on the religious aspects of Bayesian

>> vs frequentists theology.

> Look: Most practitioners of Bayesian inference probably do not

>know what entropy is. That appears to contradict what you said in

>your posting that I first answered. Can you dispute that?

>

I will definitely dispute the first part. My first professional use

of Bayesian methodology was in 1960 using seminal work of C. K. Chow,

where it was indispensible for the final design of an Optical

Character Reader for RCA. My understanding of theory was updated in

the 80s by working with Myron Tribus, of Dartmouth fame and needing to

assimilate his use of maximum entropy methods as defined in his book

Rational Descriptions, Decisions and Designs. In that period we made

extensive use of Bayesian statistics in test design and interpretation

for high end Xerox reprographic machines. Ron Howard and Howard

Raiffa of Stanford were big guns who kept us on track in our

application of theory. I suppose there may be *practitioners of

Bayesian inference who are weak on the concept of entropy* but it is

clearly and unambiguously a part of the theory of its use.

Another worthwhile web reference I uncovered recently is:

http://xyz.lanl.gov/abs/hep-ph/9512295

Probability and Measurement Uncertainty in Physics - a Bayesian

Primer by G. D'Agostini (quoting from the abstract:)

The approach, although little known and usually misunderstood among

the High Energy Physics community, has become the standard way of

reasoning in several fields of research and has recently been adopted

by the international metrology organizations in their recommendations

for assessing measurement uncertainty. (end quote)

> I have no idea what kind of software would be needed to read this

>document, so at this point it's entirely illegible to me. What do you

>mean by "BPT" and "FPT"?

The document was posted from Microsoft Office presentation software

called Powerpoint. Its unfortunate that the document is not available

in a more neutral format, but I am sending you a print version

rendered by processing his presentation through adobe acrobat, pdf

format.

BPT is the authors shorthand for Bayesian Probablity Theory and FPT

is shorthand for Frequentist Probablity Theory

John

Apr 1, 2002, 1:21:18 PM4/1/02

to

> > Look: Most practitioners of Bayesian inference probably do not

> >know what entropy is. That appears to contradict what you said in

> >your posting that I first answered. Can you dispute that?

>

>

> I will definitely dispute the first part. My first professional use

> of Bayesian methodology was in 1960 using seminal work of C. K. Chow,

> where it was indispensible for the final design of an Optical Character

> Reader for RCA. My understanding of theory was updated in the 80s by

> working with Myron Tribus, of Dartmouth fame and needing to assimilate

> his use of maximum entropy methods as defined in his book Rational

> Descriptions, Decisions and Designs. In that period we made extensive

> use of Bayesian statistics in test design and interpretation for high

> end Xerox reprographic machines. Ron Howard and Howard Raiffa of

> Stanford were big guns who kept us on track in our application of

> theory. I suppose there may be *practitioners of Bayesian inference

> who are weak on the concept of entropy* but it is clearly and

> unambiguously a part of the theory of its use.

> >know what entropy is. That appears to contradict what you said in

> >your posting that I first answered. Can you dispute that?

>

>

> I will definitely dispute the first part. My first professional use

> of Bayesian methodology was in 1960 using seminal work of C. K. Chow,

> where it was indispensible for the final design of an Optical Character

> Reader for RCA. My understanding of theory was updated in the 80s by

> working with Myron Tribus, of Dartmouth fame and needing to assimilate

> his use of maximum entropy methods as defined in his book Rational

> Descriptions, Decisions and Designs. In that period we made extensive

> use of Bayesian statistics in test design and interpretation for high

> end Xerox reprographic machines. Ron Howard and Howard Raiffa of

> Stanford were big guns who kept us on track in our application of

> theory. I suppose there may be *practitioners of Bayesian inference

> who are weak on the concept of entropy* but it is clearly and

> unambiguously a part of the theory of its use.

I don't doubt that people you worked with are familiar with

entropy, nor that some people who do Bayesian inference use entropy,

but it is perfectly obvious that such familiarity is not needed in

order to do Bayesian inference. Why do you call it "clearly and

unambiguously a part of the theory of its use"?

Mike Hardy

Apr 1, 2002, 7:04:58 PM4/1/02

to

Sorry. In a recent post on this subject I mentioned "Berg's maximum entropy

method"

I was incorrect it is "Burg's maximum entropy method". This makes a

difference in that Berg is involved in the entropy / Bayes arguments but Burg

is not. Burg's insight concerns estiamtion of the amplitude of poorly

sampled low frequency phenomena and is used a lot in signal processing. It

has turned out in practice to be reasonably useful and robust even though the

assumptions are merely "plausible" rather than proven to be necessary and

sufficient. I have not kept up with the evolution of Burgs insights over the

past decade and would appreciate some comments on where it has all led.

method"

I was incorrect it is "Burg's maximum entropy method". This makes a

difference in that Berg is involved in the entropy / Bayes arguments but Burg

is not. Burg's insight concerns estiamtion of the amplitude of poorly

sampled low frequency phenomena and is used a lot in signal processing. It

has turned out in practice to be reasonably useful and robust even though the

assumptions are merely "plausible" rather than proven to be necessary and

sufficient. I have not kept up with the evolution of Burgs insights over the

past decade and would appreciate some comments on where it has all led.

Apr 1, 2002, 7:14:35 PM4/1/02

to

On 01 Apr 2002 18:21:18 GMT, mjh...@mit.edu (Michael J Hardy) wrote:

>> > Look: Most practitioners of Bayesian inference probably do not

>> >know what entropy is. That appears to contradict what you said in

>> >your posting that I first answered. Can you dispute that?

In an earlier post, John Bailey's response to Hardy's statement was:

>> I will definitely dispute the first part.

>> I suppose there may be *practitioners of Bayesian inference

>> who are weak on the concept of entropy* but it is clearly and

>> unambiguously a part of the theory of its use.

>

Mike Hardy then replied:

> I don't doubt that people you worked with are familiar with

>entropy, nor that some people who do Bayesian inference use entropy,

>but it is perfectly obvious that such familiarity is not needed in

>order to do Bayesian inference. Why do you call it "clearly and

>unambiguously a part of the theory of its use"?

In my exposures to Bayesian methodology all have included a discussion

of how to determine a neutral Bayesian prior and the use of maximum

entropy as a means to that end.

John

Apr 1, 2002, 10:53:55 PM4/1/02

to

mike:

since this thread seemed unusually aggressive and defensive, and since i am

a practioner of bayesian inference who had never heard "entropy" associated

with that practice, i found it sufficiently interesting to do a little

checking. none of my bayesian textbooks refer to entropy, at all. . . .huh.

it was, at least, a relief to know that i had not simply slept through a key

topic.

however, since that absence seemed strange--for a line of inquiry that could

be described by someone else as "clearly and unambiguously . . ."--i checked

a little more and found dozens of references to entropy in some of

my--regrettably ill-used--books on digital signals processing. entropy seems

particularly well-associated with optical signals compression,

decompression, reading, and reproduction specifically because there is a

high value assigned to maximum loss. for example, if i didn't have to

compress everything, i could potentially save a lot. there would be an

associated cost at decompression. that sounds like a field where one might

find some bayesians.

then i stopped to think, "none of my textbooks are called anything like

Rational Descriptions, Decisions, and Designs (Tribus)," either, so maybe i

was just thinking in the wrong part of the box. unfortunately, the book is

out of print, and sells used at amazon for $176. (makes me wonder what the

original price was. maybe i'll buy it anyway. i don't know of many used

textbooks that appreciate in price.)

it's hard to be sure, but i suspect that if you think in terms of rational

decision making, you will realize that there was a lot of merit, albeit

sensitive to context, in the other position. you may also find that you are

the perfect person to write the next "bridge" text on the use of bayesian

inference in decision making.

Michael J Hardy <mjh...@mit.edu> wrote in message

news:3ca8a51e$0$3940$b45e...@senator-bedfellow.mit.edu...

Apr 2, 2002, 8:58:43 AM4/2/02

to

On Tue, 02 Apr 2002 03:53:55 GMT, "James Beck"

<james....@verizon.net> wrote:

<james....@verizon.net> wrote:

>mike:

>

>since this thread seemed unusually aggressive and defensive, and since i am

>a practioner of bayesian inference who had never heard "entropy" associated

>with that practice, i found it sufficiently interesting to do a little

>checking. none of my bayesian textbooks refer to entropy, at all. . . .huh.

>it was, at least, a relief to know that i had not simply slept through a key

>topic.

It's chapter 11 of Jaynes' book.

http://omega.albany.edu:8008/ETJ-PS/cc11g.ps

>then i stopped to think, "none of my textbooks are called anything like

>Rational Descriptions, Decisions, and Designs (Tribus)," either, so maybe i

>was just thinking in the wrong part of the box. unfortunately, the book is

>out of print, and sells used at amazon for $176. (makes me wonder what the

>original price was. maybe i'll buy it anyway. i don't know of many used

>textbooks that appreciate in price.)

>

The price of Tribus' text, published in 1969 (!) may be an indication

of how far ahead his thinking was or how little work went on in the

field until recently.

>it's hard to be sure, but i suspect that if you think in terms of rational

>decision making, you will realize that there was a lot of merit, albeit

>sensitive to context, in the other position. you may also find that you are

>the perfect person to write the next "bridge" text on the use of bayesian

>inference in decision making.

It does appear there is an information gap here. Information

arbitrage required?

Between Tribus' book (my copy of which I went to some lengths to

acquire after my first copy was borrowed and never returned), Ron

Howard's book (Dynamic Programming and Markov Processes) and Howard

Raiffa's book(Decision Analysis) it would be a lot of work to push

ahead into anything new. A quick review of

http://www-zeus.roma1.infn.it/~agostini/prob+stat.html including some

of his reprints at:

http://lanl.arXiv.org/find/physics/1/au:+DAgostini_G/0/1/0/all/0/1

suggests that Dagostini might be a good author for such a book.

Finally, I need to credit Carlos Rodriguez <car...@math.albany.edu>

for his

Maximum Entropy Online Resources

http://omega.albany.edu:8008/maxent.html

Apr 3, 2002, 10:17:53 AM4/3/02

to

rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Mar30.1...@jarvis.cs.toronto.edu>...

Let me add some more heat, uncertainty, entropy and time to this

discussion...

I can easily envision myself wasting a google amount of time fighting

wind mills over the meaning of probability and entropy... so I'll be

brief.

Please go ahead, make my day and click me!....

http://omega.albany.edu:8008/

I know that Radford is a wff (well-(in)formed-fellow): Just look at

his 93 review of MCMC (e.g. http://omega.albany.edu:8008/neal.pdf).

BUT I TOTALLY disagree with his last paragraph:

> More recently, the maximum entropy folks have pretty much abandoned

> the old version of maximum entropy in favour of Bayesian methods using

> priors that are defined in terms of entropy functions. This is

> incompatible with the old maximum entropy methods. These priors may

> be useful now and then, but there's no reason to limit yourself to them.

>

> Radford Neal

By ME folks, he means it literally. By "pretty much

abandoned...functions.", he means

http://omega.albany.edu:8008/0201016.pdf

This is NOT incompatible with the old maximum entropy methods,

(just take alpha LARGE and maximum aposteriori becomes maximum entropy

the old fashion way).

Entropic priors are not only Re-volutionary, they are E-volutionary!

By "These priors may... to them". He means,

I want to be free to continue using my convenience priors so I will

continue ignoring the fact that entropic priors are maximally

non-commital with respect to missing information (thanks Ed!) but

just-in-case I'm missing something and

entropic priors are really as cool as you claim they are I'll keep

them arround.

As Jaynes discovered:

"First they'll say that it is wrong. Then they'll say that it is not

wrong but irrelevant. And finally they'll say that it is wright and

usefull but that they knew it long time ago"

Apr 3, 2002, 11:54:29 AM4/3/02

to

There are several ways to design uninformed priors, and maximum entropy

prior is one of them. So is maximum entropy prior superior to all other

kinds of uninformed priors in all applications?

prior is one of them. So is maximum entropy prior superior to all other

kinds of uninformed priors in all applications?

Actually I know very little about maximum entropy. I have only glimpsed

through one book on maximum entropy method, and my former boss, whom I

regard as a Bayesian, never talked to me about maximum entropy.... and

I learn the Bayesian paradigm from him.

It would be nice if you can post some pointers (of tutorial type)

on ME method. Thank you.

Apr 3, 2002, 11:57:53 AM4/3/02

to

"James A. Bowery" <jim_b...@hotmail.com> wrote in message news:<ua7oas4...@corp.supernews.com>...

The book, "Rational Descriptions, Decisions and Designs" describes how

the principle of maximum entropy should be used in connection with

Bayes' Equation for a variety of problems in several fields. This

book was originally published in 1969 by Pergamon Press. It went out

of print some time after. A couple of years ago Expira of Sweden

issued a reprint. Amazon indicates that a used version may be

purchased for $175. Expira sells the reprinted version for less than

$50. Write to Hakan Sodersved <in...@expira.se> for detailed

information.

Myron Tribus mtr...@earthlink.net

350 Britto Terrace, Fremont, CA 94539

Ph: (510) 651 3641 Fax: (510) 656 9875

The establishment always rejects new ideas for it is

composed of people who, having found some of the truth yesterday

believe they possess all of it today. (E. T. Jaynes)

Apr 3, 2002, 12:58:23 PM4/3/02

to

In article <3ca8a51e$0$3940$b45e...@senator-bedfellow.mit.edu>,

> Mike Hardy

I agree with Mike. I consider the use of maximum entropy

to be an attempt to remove the prior from consideration,

and as such, it is only good if the results it gives are

similar.

Like other such methods as "non-informative" priors, etc.,

it is really anti-Bayesian. That something uses a formal

measure as a prior probability for reasons other than that

that measure is the user's prior, or gives good results for

the user's prior and loss function, does not justify it as

being a reasonable procedure.

--

This address is for information only. I do not claim that these views

are those of the Statistics Department or of Purdue University.

Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399

hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

Apr 3, 2002, 1:32:09 PM4/3/02

to

In article <3ca8f7b5...@news.frontiernet.net>,

>John

Statistics is not methodology. Treating it as such causes

people to use totally inappropriate procedures.

The first thing is to state the problem, and stating a

mathematically convenient formulation can be worse than

useless. Bayesian reasoning requires that the USER be

the provider of the loss-prior combination. Now one

might want to use something simpler if it can be proved

to be reasonably good.

So we can use least squares without normality, as the

Gauss-Markov Theorem tells us that the results are just

about as good without normality as with. This is not

true for using mathematically convenient but inappropriate

priors. Also, it is not how well the prior is approximated,

but how well the solution is.

Bayesian priors should not be "neutral", unless it can

be shown that not much is lost by using such a prior.

Conjugate priors, "uninformative" priors, maximum entropy

priors, as such are unjustified computational copouts.

Apr 3, 2002, 5:49:58 PM4/3/02

to

In article <c54f89f.02040...@posting.google.com>,

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>> More recently, the maximum entropy folks have pretty much abandoned

>> the old version of maximum entropy in favour of Bayesian methods using

>> priors that are defined in terms of entropy functions. This is

>> incompatible with the old maximum entropy methods. These priors may

>> be useful now and then, but there's no reason to limit yourself to them.

>>

>> Radford Neal

>By ME folks, he means it literally. By "pretty much

>abandoned...functions.", he means

>http://omega.albany.edu:8008/0201016.pdf

I've had a glance at this, though I can't say I've absorbed it all.

It does seem, however, that the example application to mixtures of

Gaussians produces rather strange results. According to equation

(78), the prior for the mean of a mixture component is more spread out

for rare components than for common components. Why would one want

this? Presumably, there's a problem somewhere where it's just the

right thing to do, but I don't think it's the right thing for most

problems. The argument that one should use this prior despite its

peculiar features because it is "maximally non-commital" in some sense

does not seem to me to be persuasive.

>This is NOT incompatible with the old maximum entropy methods,

>(just take alpha LARGE and maximum aposteriori becomes maximum entropy

>the old fashion way).

If I understand correctly, letting alpha go to infinity results in the

prior for the parameter being concentrated at a point. It was of

course always the case that if you found the maximum entropy

distribution and then specified your prior to be a point mass on this

distribution, then the methds were trivially "compatible". Once you

get into the details of how old "maximum entropy" methods actually

worked, however - such as how constraints on expectations were

obtained from sample means - it's clear that the way they produced a

result from the observed data is not compatible with the way a

Bayesian would produce a result by starting with a prior and

conditioning on observations.

Radford Neal

Apr 4, 2002, 8:50:45 AM4/4/02

to

rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr3.1...@jarvis.cs.toronto.edu>...

> In article <c54f89f.02040...@posting.google.com>,

> Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>

> >> More recently, the maximum entropy folks have pretty much abandoned

> >> the old version of maximum entropy in favour of Bayesian methods using

> >> priors that are defined in terms of entropy functions. This is

> >> incompatible with the old maximum entropy methods. These priors may

> >> be useful now and then, but there's no reason to limit yourself to them.

> >>

> >> Radford Neal

>

> >By ME folks, he means it literally. By "pretty much

> >abandoned...functions.", he means

> >http://omega.albany.edu:8008/0201016.pdf

>

> I've had a glance at this, though I can't say I've absorbed it all.

> It does seem, however, that the example application to mixtures of

> Gaussians produces rather strange results. According to equation

> (78), the prior for the mean of a mixture component is more spread out

> for rare components than for common components. Why would one want

> this? Presumably, there's a problem somewhere where it's just the

> right thing to do, but I don't think it's the right thing for most

> problems. The argument that one should use this prior despite its

> peculiar features because it is "maximally non-committal" in some sense

> does not seem to me to be persuasive.

>

Radford, that's not a bug. That's a feature!

Unlike Microsoft, I can prove it (Theorem 1).

In fact whatever property this prior has, it is a product of your own

ignorance. I don't mean that pejoratively I mean it logically. It is

that way because it is most difficult to discriminate from the

independent model between the data and the parameters. It is as blind

of the data as it can possibly be. It lets the data values speak for

themselves as much as it is mathematically possible.

> In article <c54f89f.02040...@posting.google.com>,

> Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>

> >> More recently, the maximum entropy folks have pretty much abandoned

> >> the old version of maximum entropy in favour of Bayesian methods using

> >> priors that are defined in terms of entropy functions. This is

> >> incompatible with the old maximum entropy methods. These priors may

> >> be useful now and then, but there's no reason to limit yourself to them.

> >>

> >> Radford Neal

>

> >By ME folks, he means it literally. By "pretty much

> >abandoned...functions.", he means

> >http://omega.albany.edu:8008/0201016.pdf

>

> I've had a glance at this, though I can't say I've absorbed it all.

> It does seem, however, that the example application to mixtures of

> Gaussians produces rather strange results. According to equation

> (78), the prior for the mean of a mixture component is more spread out

> for rare components than for common components. Why would one want

> this? Presumably, there's a problem somewhere where it's just the

> right thing to do, but I don't think it's the right thing for most

> problems. The argument that one should use this prior despite its

> does not seem to me to be persuasive.

>

Unlike Microsoft, I can prove it (Theorem 1).

In fact whatever property this prior has, it is a product of your own

ignorance. I don't mean that pejoratively I mean it logically. It is

that way because it is most difficult to discriminate from the

independent model between the data and the parameters. It is as blind

of the data as it can possibly be. It lets the data values speak for

themselves as much as it is mathematically possible.

When you say: "Presumably… for most problems" you are changing the

state of ignorance. If you realize that you do have more precise

information for the mean of the rare components in your particular

problem THEN you also realize that you FORGOT to include that

information either into h() or as a side condition. NOW, in the

absence of that information your best guess is to do as the entropic

prior says. That is a tautology yes. Maximum Entropy is tautological,

yes. But that again is not a bug. That's a feature not only of MaxEnt

but of mathematics in general.

By the way what I am saying is not new. Ed Jaynes lost his voice

screaming to the wind mills about it. Don't you agree Myron?

I know it sounds like religion and snake oil, like getting something

(The prior) from nothing (Ignorance) and for that reason many,

otherwise fine minds, have rejected the whole thing as they reject

biblical fundamentalists. As a friend of mine says: If you don't see

it I can not explain it to you!

> >This is NOT incompatible with the old maximum entropy methods,

> >(just take alpha LARGE and maximum aposteriori becomes maximum entropy

> >the old fashion way).

>

> If I understand correctly, letting alpha go to infinity results in the

> prior for the parameter being concentrated at a point. It was of

> course always the case that if you found the maximum entropy

> distribution and then specified your prior to be a point mass on this

> distribution, then the methods were trivially "compatible". Once you

> get into the details of how old "maximum entropy" methods actually

> worked, however - such as how constraints on expectations were

> obtained from sample means - it's clear that the way they produced a

> result from the observed data is not compatible with the way a

> Bayesian would produce a result by starting with a prior and

> conditioning on observations.

>

> Radford Neal

Again an old confusion… there is even a "Theorem" by a student of

Isaac Levi, the philosopher from Columbia University.

Take alpha very large (not just infinity) or very few data or no data

then the posterior is still not a point but completely dominated by

entropy so maximum a posteriori equals maximum entropy.

Apr 4, 2002, 2:57:55 PM4/4/02

to

Radford Neal:

>> >http://omega.albany.edu:8008/0201016.pdf

>>

>> I've had a glance at this, though I can't say I've absorbed it all.

>> It does seem, however, that the example application to mixtures of

>> Gaussians produces rather strange results. According to equation

>> (78), the prior for the mean of a mixture component is more spread out

>> for rare components than for common components. Why would one want

>> this? Presumably, there's a problem somewhere where it's just the

>> right thing to do, but I don't think it's the right thing for most

>> problems. The argument that one should use this prior despite its

>> peculiar features because it is "maximally non-committal" in some sense

>> does not seem to me to be persuasive.

>>

Carlos C. Rodriguez <car...@math.albany.edu>:

>Radford, that's not a bug. That's a feature!

>Unlike Microsoft, I can prove it (Theorem 1).

>In fact whatever property this prior has, it is a product of your own

>ignorance. I don't mean that pejoratively I mean it logically. It is

>that way because it is most difficult to discriminate from the

>independent model between the data and the parameters. It is as blind

>of the data as it can possibly be. It lets the data values speak for

>themselves as much as it is mathematically possible.

Consider the problem in an example context: You are interested in

how far beetles travel during a day. With really advanced satellite

observation, you can track beetles flying around, but you can't

identify the species of beetle. You know there are five species of

beetle in a certain forest for which you have data. You therefore

model the distribution of distance travelled in a day as a mixture

of five normal distributions.

Suppose we don't know much about how common the different species are,

or how much the beetles travel in a day - the situation to which you

say your method applies.

The result of your method is a prior which says that the less common

beetles are likely to travel very far in a day, or not very far at

all, whereas the more common beetles are likely to travel a more

moderate distance. This seems to drastically depart from a prior that

embodies no precise information. It seems to correspond to a very

specific biological theory claiming that rare species have to either

travel a lot in a day (to avoid being set upon by gangs of competing

beatles?), or alternatively, to stay put. In no way can I accept that

this is a prior that will "let the data values speak for themselves".

>Again an old confusion. There is even a "Theorem" by a student of

>Isaac Levi, the philosopher from Columbia University.

>Take alpha very large (not just infinity) or very few data or no data

>then the posterior is still not a point but completely dominated by

>entropy so maximum a posteriori equals maximum entropy.

Maximum a posteriori estimation is not Bayesian.

Radford Neal

Apr 4, 2002, 5:06:01 PM4/4/02

to

Herman Rubin (hru...@odds.stat.purdue.edu) wrote:

> I agree with Mike. I consider the use of maximum entropy

> to be an attempt to remove the prior from consideration,

> and as such, it is only good if the results it gives are

> similar.

>

> Like other such methods as "non-informative" priors, etc.,

> it is really anti-Bayesian.

Actually, I don't think a non-informative prior is inappropriate

to a situtation in which the person doing inference actually lacks

information. -- Mike Hardy

Apr 4, 2002, 11:39:03 PM4/4/02

to

rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr4.1...@jarvis.cs.toronto.edu>...

Nice example. Wrong interpretation.

First of all, you can't quarrel with a theorem. The entropic prior for

the parameters of the mixture i.e. for the means, sds and weights is

proven to be the most difficult to discriminate from an independent

model on the space (data,parameters). Thus, in the absence of all

other information, WHATEVER PROPERTY THIS PRIOR HAS IS THE PROPERTY

THAT IT HAS TO HAVE in order to be the most ignorant about the data.

That's the beauty of mathematics. Once you accept the proof of Theorem

1 you are stuck with it. But that's not bad. That's the power of math.

Now you can go ahead and use the prior in 14 dimensional space without

having to worry about biasing the inferences with unjustified

assumptions. That's essentially the same reason why statistical

mechanics is so successful, as discovered long time ago by our beloved

guru E.T. (phone home) Jaynes and still, after all these years, unable

to be understood even by so reputable a wiff (well-in-formed-fellow)

as yourself who by the way even presented the problem of estimation of

mixtures with an infinite number of components at one of the MaxEnt

workshops.

OK back to the specifics of your gedankenexperiment. All the prior is

saying is that, in the absence of all other information, the means of

the rare components should be considered more uncertain than the means

of the common components. You may not like that but you have to live

with it. It doesn't matter weather you or I or anyone likes it or not.

If you say, for example: "what the heck I feel intuitively that an

ignorant prior should assign equal uncertainties to all the means

independently of the weights". Then Theorem 1 will tell you that your

intuitive feeling is a superstition. By the way, uncommon components

are observed less often than common ones so more a priori uncertainty

for the mean sounds good to me, again in the absence of all other

information.

Apr 5, 2002, 9:48:03 AM4/5/02

to

>> Radford Neal:

>> Consider the problem in an example context: You are interested in

>> how far beetles travel during a day. With really advanced satellite

>> observation, you can track beetles flying around, but you can't

>> identify the species of beetle. You know there are five species of

>> beetle in a certain forest for which you have data. You therefore

>> model the distribution of distance travelled in a day as a mixture

>> of five normal distributions.

>>

>> Suppose we don't know much about how common the different species are,

>> or how much the beetles travel in a day - the situation to which you

>> say your method applies.

>>

>> The result of your method is a prior which says that the less common

>> beetles are likely to travel very far in a day, or not very far at

>> all, whereas the more common beetles are likely to travel a more

>> moderate distance. This seems to drastically depart from a prior that

>> embodies no precise information. It seems to correspond to a very

>> specific biological theory claiming that rare species have to either

>> travel a lot in a day (to avoid being set upon by gangs of competing

>> beatles?), or alternatively, to stay put. In no way can I accept that

>> this is a prior that will "let the data values speak for themselves".

>>

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>Nice example. Wrong interpretation.

>First of all, you can't quarrel with a theorem. The entropic prior for

>the parameters of the mixture i.e. for the means, sds and weights is

>proven to be the most difficult to discriminate from an independent

>model on the space (data,parameters). Thus, in the absence of all

>other information, WHATEVER PROPERTY THIS PRIOR HAS IS THE PROPERTY

>THAT IT HAS TO HAVE in order to be the most ignorant about the data.

>That's the beauty of mathematics. Once you accept the proof of Theorem

>1 you are stuck with it.

Why should I want a prior that is "most difficult to discriminate from

an independent model"? Or one that is "most ignorant about the data"?

And assuming I did want these things, why should I accept that your

mathematical formulation of what it means to be "ignorant" is the

correct one? These are not mathematical questions which can be

settled by a proof.

>OK back to the specifics of your gedankenexperiment. All the prior is

>saying is that, in the absence of all other information, the means of

>the rare components should be considered more uncertain than the means

>of the common components. You may not like that but you have to live

>with it. It doesn't matter weather you or I or anyone likes it or not.

>If you say, for example: "what the heck I feel intuitively that an

>ignorant prior should assign equal uncertainties to all the means

>independently of the weights". Then Theorem 1 will tell you that your

>intuitive feeling is a superstition.

No, what Theorem 1 tells ME is that your concept of "ignorance" is

flawed. Mathematical formulations of such concepts have to be tested

by checking their consequences in situations where intuitions are

clear. (After all, how else could one verify that the formulation is

correct?) Your formulation fails this test in this example.

>By the way, uncommon components

>are observed less often than common ones so more a priori uncertainty

>for the mean sounds good to me, again in the absence of all other

>information.

That is a good reason why the POSTERIOR uncertainty in the means of

the rare components will be greater. Why you think that this natural

effect of not having much data should be increased by also increasing

the PRIOR uncertainty is a mystery to me.

Radford Neal

Apr 5, 2002, 10:21:44 AM4/5/02

to

Let me summarize the discussion. We have:

1) John Bailey: http://www.frontiernet.net/~jmb184/

2) Mike Hardy: http://www-math.mit.edu/~hardy/

3) Herman Rubin: http://www.stat.purdue.edu/people/hrubin/

1) John Bailey: http://www.frontiernet.net/~jmb184/

2) Mike Hardy: http://www-math.mit.edu/~hardy/

3) Herman Rubin: http://www.stat.purdue.edu/people/hrubin/

Bailey: Entropy is an important concept in Bayesian Inference.

Hardy: Few people working in Bayesian Inference care about Entropy.

Rubin: The people that use entropy or whatever other so called

"neutral" priors are using unjustified computational copouts.

My position:

1) Hurray for Bailey!

2) Sure Mike but they should know better.

3) I disagree with Rubin's position with all the energy in my

reproductive system.

First of all, as far as it is known today, Entropy, Probability, and

(more recently discovered) Codes (as in binary codes) are pretty much

aspects of the same thing. At a fundamental level, Entropy is just the

number of available distinguishable possibilities in the (neg)-log

scale so that exp(Entropy)=1/N =Uniform probability over the space of

distinguishable states. Moreover, there is a one-to-one correspondence

between probability distributions and codes (or rather code lengths of

root-free (prefix) codes)

(e.g. see Grunwald's tutorial

http://quantrm2.psy.ohio-state.edu/injae/workshop.htm ) . Thus, any

one caring about the meaning and use of Probability theory (Bayesians

or members of the national riffle association alike) aught to care

about Entropy and Codes.

Second. More than seventy (70) years of DeFinetti/Savage subjectivism

have produce ZIP beyond beautiful sun tans from the coasts of Spain!

Third. Current action in fundamental statistical inference (aside from

computational issues) is about objective (or as objective as possible)

quantifications of prior information. Information geometry, MDL

principle, Entropic Priors, Bayesian Networks and Statistical Learning

Theory are pushing the envelope.

hru...@odds.stat.purdue.edu (Herman Rubin) wrote in message news:<a8fhr9$1q...@odds.stat.purdue.edu>...

Apr 5, 2002, 12:20:38 PM4/5/02

to

Radford Neal (rad...@cs.toronto.edu) wrote:

> Why should I want a prior that is "most difficult to discriminate from

> an independent model"? Or one that is "most ignorant about the data"?

Your prior needs to incorporate your ignorance if you are ignorant.

Tomorrow's weather and the outcome of a coin toss are _conditionally_

_independent_given_my_knowledge_ if I have no knowledge of any connection

between them.

Mike Hardy

Apr 5, 2002, 2:14:50 PM4/5/02

to

> Radford Neal (rad...@cs.toronto.edu) wrote:

>

>> Why should I want a prior that is "most difficult to discriminate from

>> an independent model"? Or one that is "most ignorant about the data"?

>

>> Why should I want a prior that is "most difficult to discriminate from

>> an independent model"? Or one that is "most ignorant about the data"?

Michael J Hardy <mjh...@mit.edu> wrote:>

>

> Your prior needs to incorporate your ignorance if you are ignorant.

There's a logical gap between saying "this prior expresses ignorance

about the data" and "I'm ignorant, therefore I should use this prior".

The first statement implicitly assumes that there's only one possible

"state of ignorance". But it's not clear that real people can be

ignorant in only one way.

As evidence for this logical gap, one need only see that "objective"

Bayesians have come up with numerous priors that all supposedly

express ignorance. It's like the joke about standards for programming

languages - "If one standard is good, then three standards must be

even better!".

>Tomorrow's weather and the outcome of a coin toss are _conditionally_

>_independent_given_my_knowledge_ if I have no knowledge of any connection

>between them.

If you're SURE that there's no connection, then you're not ignorant at

all about the relationship (however ignorant you may be about

individual coin tosses and thunderstorms). If you're NOT sure that

there's no relationship, then the independence applies only to the

FIRST coin toss and thunderstorm. Once you are dealing with more than

one toss, you need to use a prior that expresses how likely the

various possible relationships are. This is related to the fallacy

behind Jaynes contention that the laws of statistical mechanics can be

derived from the maximum entropy principle, without the need for any

input of physical information.

Radford Neal

Apr 5, 2002, 3:00:55 PM4/5/02

to

rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr5.0...@jarvis.cs.toronto.edu>...

> >> Radford Neal:

>

> Why should I want a prior that is "most difficult to discriminate from

> an independent model"? Or one that is "most ignorant about the data"?

> And assuming I did want these things, why should I accept that your

> mathematical formulation of what it means to be "ignorant" is the

> correct one? These are not mathematical questions which can be

> settled by a proof.

> >> Radford Neal:

>

> Why should I want a prior that is "most difficult to discriminate from

> an independent model"? Or one that is "most ignorant about the data"?

> And assuming I did want these things, why should I accept that your

> mathematical formulation of what it means to be "ignorant" is the

> correct one? These are not mathematical questions which can be

> settled by a proof.

Recall: X and Y independent iff:

1) P(X|Y) = P(X)

and

2) P(Y|X) = P(Y)

provided both conditionals exist or more conveniently, but less

enlightening,

X and Y independent iff

P(X and Y) = P(X) P(Y)

By "X is ignorant about Y" I mean X is independent of Y. PERIOD.

How much more ignorant of each other can X and Y be?

Are you suggesting changing the meaning of independence?

>

> >OK back to the specifics of your gedankenexperiment. All the prior is

> >saying is that, in the absence of all other information, the means of

> >the rare components should be considered more uncertain than the means

> >of the common components. You may not like that but you have to live

> >with it. It doesn't matter weather you or I or anyone likes it or not.

> >If you say, for example: "what the heck I feel intuitively that an

> >ignorant prior should assign equal uncertainties to all the means

> >independently of the weights". Then Theorem 1 will tell you that your

> >intuitive feeling is a superstition.

>

> No, what Theorem 1 tells ME is that your concept of "ignorance" is

> flawed. Mathematical formulations of such concepts have to be tested

There is no mysterious concept of "ignorance" anymore. It is JUST

INDEPENDENCE!

(see the above)

> >By the way, uncommon components

> >are observed less often than common ones so more a priori uncertainty

> >for the mean sounds good to me, again in the absence of all other

> >information.

>

> That is a good reason why the POSTERIOR uncertainty in the means of

> the rare components will be greater. Why you think that this natural

> effect of not having much data should be increased by also increasing

> the PRIOR uncertainty is a mystery to me.

>

BECAUSE: by assumption the only information assumed is the likelihood.

The ignorant prior is only consistent with the info explicitly

provided, in this case by the likelihood. The parameters for the

uncommon components need to be obviously more uncertain otherwise you

would be claiming a source of information other than the likelihood.

Think about it this way. If you assume that you can ONLY learn about

the beetles by observing them, then you can only know more about the

ones that you can observe more. Whatever prior information you are

going to provide about the rare species of beetles would have to have

come from past observations and by assumption these are more scarce

ergo more prior uncertainty is just compatible with that.

Apr 5, 2002, 4:00:09 PM4/5/02

to

In article <c54f89f.02040...@posting.google.com>,

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>Recall: X and Y independent iff:

>1) P(X|Y) = P(X)

>and

>2) P(Y|X) = P(Y)

>provided both conditionals exist or more conveniently, but less

>enlightening,

>X and Y independent iff

>

> P(X and Y) = P(X) P(Y)

>

>By "X is ignorant about Y" I mean X is independent of Y. PERIOD.

>How much more ignorant of each other can X and Y be?

>Are you suggesting changing the meaning of independence?

No. I'm suggesting that "independence" and "ignorance" may not be the

same thing. For one thing, independence is a relationship between

random variables, whereas ignorance is a relationship between a person

and a situation (perhaps described by a set of random variables). So

your phrase "X is ignorant about Y", in which X is a random variable

really makes no sense.

>> >By the way, uncommon components

>> >are observed less often than common ones so more a priori uncertainty

>> >for the mean sounds good to me, again in the absence of all other

>> >information.

>>

>> That is a good reason why the POSTERIOR uncertainty in the means of

>> the rare components will be greater. Why you think that this natural

>> effect of not having much data should be increased by also increasing

>> the PRIOR uncertainty is a mystery to me.

>

>BECAUSE: by assumption the only information assumed is the likelihood.

>The ignorant prior is only consistent with the info explicitly

>provided, in this case by the likelihood. The parameters for the

>uncommon components need to be obviously more uncertain otherwise you

>would be claiming a source of information other than the likelihood.

>Think about it this way. If you assume that you can ONLY learn about

>the beetles by observing them, then you can only know more about the

>ones that you can observe more.

But you're claiming to know more about the more common beatles even

BEFORE you observe them, just because you're ANTICIPATING observing

them later on. This is irrational.

>Whatever prior information you are

>going to provide about the rare species of beetles would have to have

>come from past observations and by assumption these are more scarce

>ergo more prior uncertainty is just compatible with that.

Why can't I have prior information about beatles based on my general

knowledge of biology, rather than based on having run the EXACT same

experiment previously, as you seem to be assuming? Note that ALL

humans have quite a bit of general knowledge about biology (being

biological entites themselves).

Radford Neal

Apr 5, 2002, 4:14:39 PM4/5/02

to

In article <3cacce49$0$3930$b45e...@senator-bedfellow.mit.edu>,> Herman Rubin (hru...@odds.stat.purdue.edu) wrote:

For one thing, does such a situation exist?

For another, what is non-informative?

In some cases, one can show that certain types of priors

give good approximations, and event that procedures which

do not compute posteriors can be good.

For example, in testing a point null against a finite

dimensional composite alternative, placing a point mass

at the null and a constant density on the alternative

yields robust results for moderately large samples, and

this works even if one has to use some asymptotic theory

for the test statistics allowed, such as requiring that

the Kolmogorov-Smirnov test be used at the Bayesian

level for it.

Apr 5, 2002, 4:25:04 PM4/5/02

to

In article <2002Apr5.0...@jarvis.cs.toronto.edu>,

Radford Neal <rad...@cs.toronto.edu> wrote:

>>> Radford Neal:

Radford Neal <rad...@cs.toronto.edu> wrote:

>>> Radford Neal:

..................

>Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>>Nice example. Wrong interpretation.

>>First of all, you can't quarrel with a theorem. The entropic prior for

>>the parameters of the mixture i.e. for the means, sds and weights is

>>proven to be the most difficult to discriminate from an independent

>>model on the space (data,parameters). Thus, in the absence of all

>>other information, WHATEVER PROPERTY THIS PRIOR HAS IS THE PROPERTY

>>THAT IT HAS TO HAVE in order to be the most ignorant about the data.

>>That's the beauty of mathematics. Once you accept the proof of Theorem

>>1 you are stuck with it.

Why should anyone want to consider this as a criterion? It

is a pure mathematics criterion, and what does it have to

do with the problem of statistical inference?

>Why should I want a prior that is "most difficult to discriminate from

>an independent model"? Or one that is "most ignorant about the data"?

>And assuming I did want these things, why should I accept that your

>mathematical formulation of what it means to be "ignorant" is the

>correct one? These are not mathematical questions which can be

>settled by a proof.

The attempt to avoid input from the one with the statistical

problem violates several of my "Commandments". Here they

are, and this does show what one needs to consider. In

particular, Mr. Rodriguez is violating either #3 or #5.

It is religious ritual, rather than good statistics, to

let the prior or loss, and it is only their product which

matters in any case, come from other than consideration

of the problem. Now one might be able to show that using

maximum entropy does a good job of approximating the

results wanted, in which case #4 can be used to justify it.

But, without doing this, there is no justification for the

use of maximum entropy.

I am often requested to repost my five commandments. These are

posted here without exegesis.

For the client:

1. Thou shalt know that thou must make assumptions.

2. Thou shalt not believe thy assumptions.

For the consultant:

3. Thou shalt not make thy client's assumptions for him.

4. Thou shalt inform thy client of the consequences

of his assumptions.

For the person who is both (e. g., a biostatistician or psychometrician):

5. Thou shalt keep thy roles distinct, lest thou violate

some of the other commandments.

The consultant is obligated to point out how their assumptions affect

their views of their domain; this is in the 4-th commandment. But the

consultant should be very careful in the assumption-making process not

to intrude beyond possibly pointing out that certain assumptions make

large differences, while others do not. A good example here is regression

analysis, where often normality has little effect, but the linearity of

the model is of great importance. Thus, it is very important for the

client to have to justify transformations.

There are, unfortunately, many fields in which much of the activity

consists of using statistical procedures without regard for any assumptions.

Apr 6, 2002, 2:05:39 PM4/6/02

to

rad...@cs.toronto.edu (Radford Neal) wrote in message news:<2002Apr5.1...@jarvis.cs.toronto.edu>...

> In article <c54f89f.02040...@posting.google.com>,

> Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>

> >Recall: X and Y independent iff:

> >1) P(X|Y) = P(X)

> >and

> >2) P(Y|X) = P(Y)

> >provided both conditionals exist or more conveniently, but less

> >enlightening,

> >X and Y independent iff

> >

> > P(X and Y) = P(X) P(Y)

> >

> >By "X is ignorant about Y" I mean X is independent of Y. PERIOD.

> >How much more ignorant of each other can X and Y be?

> >Are you suggesting changing the meaning of independence?

>

> No. I'm suggesting that "independence" and "ignorance" may not be the

> same thing. For one thing, independence is a relationship between

> random variables, whereas ignorance is a relationship between a person

> and a situation (perhaps described by a set of random variables). So

> your phrase "X is ignorant about Y", in which X is a random variable

> really makes no sense.

>

This sounds like a desperate kick to me.

Just model "person" by another set of rvs that specify its state i.e.

the parameters of the likelihood.

> In article <c54f89f.02040...@posting.google.com>,

> Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>

> >Recall: X and Y independent iff:

> >1) P(X|Y) = P(X)

> >and

> >2) P(Y|X) = P(Y)

> >provided both conditionals exist or more conveniently, but less

> >enlightening,

> >X and Y independent iff

> >

> > P(X and Y) = P(X) P(Y)

> >

> >By "X is ignorant about Y" I mean X is independent of Y. PERIOD.

> >How much more ignorant of each other can X and Y be?

> >Are you suggesting changing the meaning of independence?

>

> No. I'm suggesting that "independence" and "ignorance" may not be the

> same thing. For one thing, independence is a relationship between

> random variables, whereas ignorance is a relationship between a person

> and a situation (perhaps described by a set of random variables). So

> your phrase "X is ignorant about Y", in which X is a random variable

> really makes no sense.

>

Just model "person" by another set of rvs that specify its state i.e.

the parameters of the likelihood.

> >> >By the way, uncommon components

> >> >are observed less often than common ones so more a priori uncertainty

> >> >for the mean sounds good to me, again in the absence of all other

> >> >information.

> >>

> >> That is a good reason why the POSTERIOR uncertainty in the means of

> >> the rare components will be greater. Why you think that this natural

> >> effect of not having much data should be increased by also increasing

> >> the PRIOR uncertainty is a mystery to me.

> >

> >BECAUSE: by assumption the only information assumed is the likelihood.

> >The ignorant prior is only consistent with the info explicitly

> >provided, in this case by the likelihood. The parameters for the

> >uncommon components need to be obviously more uncertain otherwise you

> >would be claiming a source of information other than the likelihood.

> >Think about it this way. If you assume that you can ONLY learn about

> >the beetles by observing them, then you can only know more about the

> >ones that you can observe more.

>

> But you're claiming to know more about the more common beatles even

> BEFORE you observe them, just because you're ANTICIPATING observing

> them later on. This is irrational.

>

Behind your bravado I sense that you are about to get the point.

Think about it this way: Suppose that you assume equal uncertainty

about the parameters of all the components. If all the components are

assumed identical then it makes sense BUT if you assume that one

component is more rare than the others THEN the assumption of equal

uncertainty can not be done WITHOUT claiming extra knowledge beyond

the likelihood. (see below for more… )

> >Whatever prior information you are

> >going to provide about the rare species of beatles would have to have

> >come from past observations and by assumption these are more scarce

> >ergo more prior uncertainty is just compatible with that.

>

> Why can't I have prior information about beatles based on my general

> knowledge of biology, rather than based on having run the EXACT same

> experiment previously, as you seem to be assuming? Note that ALL

> humans have quite a bit of general knowledge about biology (being

> biological entites themselves).

>

> Radford Neal

This last paragraph of yours (above) clearly shows why we keep barking

at two quite different trees.

 Why can't I have prior info…. ?

Sure. You can have all kinds of prior info about beatles. The more the

better.

But if you do YOU HAVE TO ADD THAT PRIOR INFO EXPLICITLY. Either

directly to the model, to the initial guess h() or as a constraint to

the variational problem. Once you have explicitly accounted for all

the prior info that you claim you have THEN you want to find the prior

distribution that uses that prior info AND NOTHING ELSE. That's as

honest and as objective as anyone can be.

Encore:

There is nothing wrong with using convenience priors specially if you

are already getting useful answers with them.

(If it works… it's true! Isn't that the American way? Hmm there are

"issues"…)

With convenience priors either the data swamps the prior assumptions

OR you hit the gold by a bit of luck and clever design (always

useful).

But today we offer an alternative new way to our customers…. Encode

what you claim you know EXPLICITLY and then maximize honesty to get

THE ENTROPIC PRIOR and seat back relax and enjoy the show!

Only one problem: It may not be cheap! You may need to build a new

computer or just settle for a cheap approximation in some cases.

Apr 6, 2002, 2:54:57 PM4/6/02

to

rad...@cs.toronto.edu (Radford Neal) wrote:

>> No. I'm suggesting that "independence" and "ignorance" may not be the

>> same thing. For one thing, independence is a relationship between

>> random variables, whereas ignorance is a relationship between a person

>> and a situation (perhaps described by a set of random variables). So

>> your phrase "X is ignorant about Y", in which X is a random variable

>> really makes no sense.

>>

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>This sounds like a desperate kick to me.

>Just model "person" by another set of rvs that specify its state i.e.

>the parameters of the likelihood.

This makes even less sense. I had thought that like almost all

Bayesians, you viewed probability as a representation of beliefs.

(Our disagreement, I thought, was over whether there is any such thing

as completely "objective" beliefs). So whose beliefs are being

modeled by this joint distribution over the random variables

describing the world and the random variables describing the person's

beliefs?

>Behind your bravado I sense that you are about to get the point.

>Think about it this way: Suppose that you assume equal uncertainty

>about the parameters of all the components. If all the components are

>assumed identical then it makes sense BUT if you assume that one

>component is more rare than the others THEN the assumption of equal

>uncertainty can not be done WITHOUT claiming extra knowledge beyond

>the likelihood. (see below for more).

Once you realize that the species may differ in abundance, then you

might indeed wonder whether your prior beliefs about other

characteristics should depend on the abundance. You have to think

about it. But it seems pretty bizzarre to me to take the position

that believing these other characteristics vary in a rather peculiar

way with abundance is the DEFAULT, which you should adopt as your

belief if you haven't any reason not to.

>Sure. You can have all kinds of prior info about beatles. The more the

>better.

>But if you do YOU HAVE TO ADD THAT PRIOR INFO EXPLICITLY. Either

>directly to the model, to the initial guess h() or as a constraint to

>the variational problem. Once you have explicitly accounted for all

>the prior info that you claim you have THEN you want to find the prior

>distribution that uses that prior info AND NOTHING ELSE. That's as

>honest and as objective as anyone can be.

This sounds attractive. The problem is that it just doesn't work.

The attempts to formalize the idea of using the explict information

"and nothing else" produce results that are neither unique nor in

(in some cases) sensible.

Radford Neal

Apr 6, 2002, 8:47:04 PM4/6/02

to

On Fri, 29 Mar 2002 12:51:29 GMT, jmb...@frontiernet.net (John Bailey)

wrote:

wrote:

>On Thu, 28 Mar 2002 19:33:24 -0800, "James A. Bowery"

><jim_b...@hotmail.com> wrote:

>

>>I'm interested in locating fundamental work in maximum entropy imputation

>>for simple data tables.

Given the original poster's question, does anyone have better

suggestions for references to fundamental work on maximum entropy

imputation than these?

>

>Missing Data, Censored Data, and Multiple Imputation

>http://cm.bell-labs.com/cm/ms/departments/sia/project/mi/index.html

>

>Bayesian Statistics

>http://cm.bell-labs.com/cm/ms/departments/sia/project/bayes/index.html

>

>Multiple Imputation

>http://www.stat.ucla.edu/~mhu/impute.html

>

>"Multiple Imputation for Missing Data: Concepts and New Development"

>http://www.sas.com/rnd/app/papers/multipleimputation.pdf

>

>Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys,

>New York: John Wiley & Sons, Inc.

>

>Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, New

>York: Chapman and Hall

>

>http://www.sas.com/rnd/app/da/new/pdf/dami.pdf

>

>Multiple Imputation References

>http://www.statsol.ie/solas/sorefer.htm

>

>John

Apr 7, 2002, 10:28:33 PM4/7/02

to

I think we have reached the point of diminishing returns.

We have barked at each other our points several times and we may

finally agree to disagree.

We have barked at each other our points several times and we may

finally agree to disagree.

To make a last attempt at convincing you (and other present and future

watchers out there) of the importance of Theorem1 in,

http://omega.albany.edu:8008/0201016.pdf

let's clean the blackboard and summarize.

============================================

General Fact1:

Among all possible distributions on the parameters of a regular

parametric model, the one that is most difficult to discriminate from

an independent model on (parameters,data) is the Entropic Prior.

Specific Fact2:

The Entropic Prior for the parameters of a Gaussian mixture turns out

to be very similar to the popular conjugate prior except that the

uncertainty on the parameters of each component depends on the weight

assigned to that component. The smaller the weight the larger the

uncertainty.

==============================================

Your (RN) position:

Forget Fact1. I find your Specific Fact2 counter-intuitive and even

irrational. Ergo, we can happily forget about the Entropic Prior

business.

My (CR) position:

Whatever intuition you may have about how a most ignorant prior about

the data should look like, if your intuition doesn't agree with the

Facts above then, provided the Facts above are correct, we can happily

forget about your intuition.

==============================================

Your Argument:

1) Why should anyone care about the prior that YOU say is the most

difficult to discriminate from an independent model meaning something

that looks to me as a cooked up manipulation of symbols to get what

you want?

2) And look, your specific Fact2 is clearly crazy for I can find lots

of real life examples where it appears as encoding prior information

that doesn't exist or even worst that runs contrary to what we know

for that problem.

Here is an example: Suppose we want to study how far the

population of beatles from the north of Srilanka are able to travel.

Suppose that we know that the beatles from Srilanka are of one of two

kinds. One populous species and one rare species. We naturally model

the observed data of traveled distances as a two component mixture or

gaussians. Your entropic prior will assign A PRIORI more uncertain to

the average distance traveled by the rare species. That's SILLY I

could have all kinds of bio info against that!

My Argument:

1) The math behind Fact1 is standard and (subtle but once understood)

trivial.

The Kullback number between two probability measures P and Q, denoted

I(P:Q)

(*** where for us:

P=f(data|params)*p(params) (i.e. likelihood times prior) and

Q=h(data)*g(params) (i.e. an independent model, some arbitrary but

fix density h() for data and the local uniform g() on the (manifold)

parameters)

defined on the same measurable space (data,parameters)

****)

is the universally accepted information-theoretic-probabilistic

measure of how easy it is to discriminate Q from P. It is nothing but

the mean information for discrimination in favor or P and against Q

when sampling from P. Look at the first chapter of Kullback's book or

ask your gurus or search the net or whatever. Just in case you still

have issues with I(P:Q) let me remind anyone watching that a simple

monotone increasing function of I(P:Q) is an upper bound to the total

variation distance between P and Q (Bretagnolle-Huber inequality).

TRANSLATION: If I(P:Q) is small (close to 0) then P and Q are close in

total variation i.e. close in the most natural way for probability

measures.

FACT1 (again):

The proper prior p(params) that minimizes I(P:Q), when data consists

of alpha independent observations of the model, is the Entropic Prior

with parameters h and alpha. It is only natural to call this prior

most ignorant about the data since Q is an independent (product) model

"h(data)*g(params)" where params are statistically independent of the

data.

There is nothing fishy or unnatural about Fact1. The true power of

Fact1 comes from its generality. It holds for ANY regular hypothesis

space in any number of dimensions. Even in infinite dimensional (i.e.

for stochastic processes) hypothesis spaces… but there is no room in

the electronic margins of this note to show you the proof… (ok I am

pushing it a little…).

*** I remind whoever is listening that once you allow Fact1 to get

"jeegee" (As in Austin Powers "Get jeegee with it") with your mind,

you become pregnant and there is no need to bother with answering (2).

Your baby-to-be will give you the answer! For those virtuous minds

still out there, here is a way:

2) The only prior information that we assume that we have about the

beatles is the one in the likelihood and the parameters of the

entropic prior (h and alpha). NOTHING ELSE. If there is extra prior

info, biological or whatever ,that info must be EXPLICITLY included in

the problem. Either in the likelihood, h,alpha or as a constraint for

the minimization of I(P:Q). Only after including ALL the info that we

want to consider, only after that, we maximize honesty and take the

most ignorant prior consistent with what we know. Fact2, as it is

presented here, applies only to that state of ignorance.

When all we assume we know is the likelihood, Fact2 is not only sane

but obvious. Of course the parameters of the rare components of the

mixture are A PRIORI more uncertain. There is always less info coming

from there and we know that A PRIORI even BEFORE we collect any data.

Another way to state this is:

THE ONLY way to be able to assume equal uncertainty for all the

components regardless of their mixture weights is to ASSUME a source

of information OTHER than the likelihood. Q.E.D.

Extra bonus: The above argument opens the gates of uncertainty to all

the MCMC simulations based on the standard conjugate prior for

mixtures of Gaussians.

P.S.

I am willing to spend another google amount of time because I do find

you one of the coolest MCMC guys around. Your exposition of the hybrid

monte carlo method in http://omega.albany.edu:8008/neal.pdf was an

eye opener for me, and I believe there is still a diamond's mine to be

discovered along those directions. Now that you know that I know that

I think you are so cool let me tell you that you are nevertheless

human. But that's OK. Isn't it?

(It would be great if you go to Moscow, Id. this summer for

MaxEnt2002)

Apr 8, 2002, 12:16:53 AM4/8/02

to

In article <c54f89f.02040...@posting.google.com>,

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

Carlos C. Rodriguez <car...@math.albany.edu> wrote:

>I think we have reached the point of diminishing returns.

>We have barked at each other our points several times and we may

>finally agree to disagree.

I think you're right. Your summary of the two positions is reasonably

accurate. When you get to arguing that your's is the correct position,

I of course disagree, and I could explain why I think so - but I've

already explained in previous posts, so we should probably let readers

of this thread (assuming there still are any) ponder the matter on

their own.

Radford Neal

Apr 12, 2002, 9:17:27 AM4/12/02

to

In article <c54f89f.02040...@posting.google.com>,

>I think we have reached the point of diminishing returns.

>We have barked at each other our points several times and we may

>finally agree to disagree.

>We have barked at each other our points several times and we may

>finally agree to disagree.

>To make a last attempt at convincing you (and other present and future

>watchers out there) of the importance of Theorem1 in,

>http://omega.albany.edu:8008/0201016.pdf

>let's clean the blackboard and summarize.

>General Fact1:

>Among all possible distributions on the parameters of a regular

>parametric model, the one that is most difficult to discriminate from

>an independent model on (parameters,data) is the Entropic Prior.

This is from the standpoint of Wiener-Shannor information,

not that of statistical inference.

>Specific Fact2:

>The Entropic Prior for the parameters of a Gaussian mixture turns out

>to be very similar to the popular conjugate prior except that the

>uncertainty on the parameters of each component depends on the weight

>assigned to that component. The smaller the weight the larger the

>uncertainty.

I am strongly opposed to the anti-Bayesian use of the

conjugate prior, preferring instead to look at robustness

of the procedure. If estimating a normal mean with the

prior being not too concentrated, a normal prior is the

one I would be least likely to use, as it is far too

sensitive. If the prior is concentrated, it does not

make much difference if it is normal or not, as long as

it has a small variance; it it does not have a small

variance, robustness is very difficult.

In fact, if one assumes a normal prior, the Bayes risk

is at most doubled if one replaces a prior whose variance

is at least the data variance by an infinite variance, and

if the variance is at most the data variance by a one-point

distribution.

The prior should come from the user's assumptions, not

from mathematical convenience. One can use robustness

theorems to approximate procedures, but it is the effect

on the risk, not the closeness of the prior, which is

the relevant consideration, and these are quite different.

In testing a point or local hypothesis, the prior

probability of the hypothesis is often totally irrelevant

if there is any data.

0 new messages