444 views

Skip to first unread message

May 28, 2009, 7:56:54 AM5/28/09

to one-...@googlegroups.com

This is a reply to a part of Abram's "join post" on the everything-list

(http://dragonlogic-ai.blogspot.com/2008/11/interesting-mailing-list-i-recently.html).

I'm the owner of that mailing list but haven't been following it closely in

recent times, so I only just noticed Abram's post. I'm replying here since

the topic might be more suitable for this list.

(http://dragonlogic-ai.blogspot.com/2008/11/interesting-mailing-list-i-recently.html).

I'm the owner of that mailing list but haven't been following it closely in

recent times, so I only just noticed Abram's post. I'm replying here since

the topic might be more suitable for this list.

Abram Demski wrote:

> As I said, I'm also interested in the notion of probability. I

> disagree with Solomonoff's universal distribution

> (http://en.wikipedia.org/wiki/Ray_Solomonoff), because it assumes that

> the universe is computable. I cannot say whether the universe we

> actually live in is computable or not; however, I argue that,

> regardless, an uncomputable universe is at least conceivable, even if

> it has a low credibility. So, a universal probability distribution

> should include that possibility.

This is my position as well: there is no reason to assume that the universe

must be computable, as Solomonoff's universal distribution does. For a while

I tried to learn enough logic to see what a truly universal distribution

would look like. But then I found an argument, related to Berry's Paradox,

that seems to show that any attempt at a truly universal distribution must

fail. (From Wikipedia: The Berry paradox is a self-referential paradox

arising from the expression "the smallest possible integer not definable by

a given number of words.")

The argument is very simple. Consider an arbitrary probability distribution

P, and the smallest integer (or the lexicographically least object) x such

that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short

description, a universal distribution shouldn't assign it such a low

probability, but P does, so P can't be a universal distribution.

(I've stated variants of this argument before at [1] and [2], but this

version seems to be more general.)

Is there a way around this? If not, how should we (or an AI) deal with

uncertainty without being able to use a universal distribution as the prior?

[1] "is induction unformalizable?"

http://groups.google.com/group/everything-list/browse_thread/thread/c7442c13ff1396ec

[2] http://lesswrong.com/lw/f9/a_request_for_open_problems/bws

May 28, 2009, 11:21:46 PM5/28/09

to One Logic

Solomonoff's universal distribution isn't trying to assume the

universe is computable. It's trying to beat any computable

probability distribution you could put over the universe. In other

words, if your attempt to *calculate*, to *talk about* and *reason

about* the universe is computable - even if you are talking *about*

systems that you think are uncomputable - then your ratiocinations

appear somewhere in Solomonoff induction.

On May 28, 4:56 am, "Wei Dai" <wei...@weidai.com> wrote:

> This is a reply to a part of Abram's "join post" on the everything-list

> (http://dragonlogic-ai.blogspot.com/2008/11/interesting-mailing-list-i...).

> [1] "is induction unformalizable?"http://groups.google.com/group/everything-list/browse_thread/thread/c...

> [2]http://lesswrong.com/lw/f9/a_request_for_open_problems/bws

universe is computable. It's trying to beat any computable

probability distribution you could put over the universe. In other

words, if your attempt to *calculate*, to *talk about* and *reason

about* the universe is computable - even if you are talking *about*

systems that you think are uncomputable - then your ratiocinations

appear somewhere in Solomonoff induction.

On May 28, 4:56 am, "Wei Dai" <wei...@weidai.com> wrote:

> This is a reply to a part of Abram's "join post" on the everything-list

> [2]http://lesswrong.com/lw/f9/a_request_for_open_problems/bws

May 29, 2009, 3:31:29 PM5/29/09

to one-...@googlegroups.com

Eliezer,

Good point. Although, since there are other uncomputable distributions

that dominate the Solomonoff distribution, that description does not

uniquely recommend the Solomonoff distribution. One alternative would

be Jurgen Schmidhuber's universal distribution over what he called

"descriptions". But, that distribution is not even computable in the

limit, unlike the Solomonoff distribution. This could be used to argue

that the Solomonoff distribution is the "best we can do" in some

sense... but, even though we cannot approximate Jurgen's distribution

to arbitrary precision given enough processing power, I suppose we

could *partially* approximate it, using a theorem prover to prove

convergence for as many descriptions as possible... and, a similar

argument holds for even more difficult distributions.

The question is, is there a significant sense in which this is

different from approximating the Solomonoff distribution?

(For those on this list not familiar with these issues, I recommend

starting by looking up "algorithmic information theory".)

--Abram

--

Abram Demski

http://dragonlogic-ai.blogspot.com/

Good point. Although, since there are other uncomputable distributions

that dominate the Solomonoff distribution, that description does not

uniquely recommend the Solomonoff distribution. One alternative would

be Jurgen Schmidhuber's universal distribution over what he called

"descriptions". But, that distribution is not even computable in the

limit, unlike the Solomonoff distribution. This could be used to argue

that the Solomonoff distribution is the "best we can do" in some

sense... but, even though we cannot approximate Jurgen's distribution

to arbitrary precision given enough processing power, I suppose we

could *partially* approximate it, using a theorem prover to prove

convergence for as many descriptions as possible... and, a similar

argument holds for even more difficult distributions.

The question is, is there a significant sense in which this is

different from approximating the Solomonoff distribution?

(For those on this list not familiar with these issues, I recommend

starting by looking up "algorithmic information theory".)

--Abram

Abram Demski

http://dragonlogic-ai.blogspot.com/

May 29, 2009, 3:34:55 PM5/29/09

to One Logic

Eliezer Yudkowsky wrote:

> Solomonoff's universal distribution isn't trying to assume the

> universe is computable. It's trying to beat any computable

> probability distribution you could put over the universe. In other

> words, if your attempt to *calculate*, to *talk about* and *reason

> about* the universe is computable - even if you are talking *about*

> systems that you think are uncomputable - then your ratiocinations

> appear somewhere in Solomonoff induction.

> Solomonoff's universal distribution isn't trying to assume the

> universe is computable. It's trying to beat any computable

> probability distribution you could put over the universe. In other

> words, if your attempt to *calculate*, to *talk about* and *reason

> about* the universe is computable - even if you are talking *about*

> systems that you think are uncomputable - then your ratiocinations

> appear somewhere in Solomonoff induction.

That's an interesting perspective. I agree that a human being, assuming he

is computable, couldn't beat Solomonoff's universal distribution in a

prediction game. He might come to believe that the universe has uncomputable

physics, but that doesn't seem to help him gain an advantage over

Solomonoff.

But what if the human could augment himself with uncomputable physics, and

start thinking uncomputable thoughts? He could then have an uncomputable

prior, which could possibly beat Solomonoff's universal distribution. Can a

Bayesian agent do the same thing? My initial thought is "no", since a

Bayesian isn't supposed to change his prior. But a Bayesian agent can

perhaps build external calculating devices (e.g. hypercomputers) to help

himself, and maybe the combined system of agent+hypercomputer can be modeled

as an extended agent with an uncomputable prior. I'll have to think about

this some more...

May 29, 2009, 3:40:59 PM5/29/09

to one-...@googlegroups.com

Wei Dai,

Glad to hear from you. I am struck anew by the simplicity of this

paradox every time I look at it.

For those on this list not familiar with the background material for

this paradox, it essentially relies on a formalization of Occam's

Razor. The basic assumption is that, in absence of evidence, the only

deciding factor between two theories is their length written down--

probability decreases as length increases. So, anything with a short

description length is fairly probable.

I've attempted to reply to this paradox before, but I only sort of

waved my hands and pointed to assumptions that might be questionable.

Right now, my first thought is that the lexographically least object

such that P(x) < 1/3 ^^^ 3 is ill-defined. (For those not familiar

with knuth's up-arrow notation, I recommend looking it up, just

because it is fun. Although, really, all you need to know is that

1/3^^^3 is a very small probability with a fairly short description.)

So, what plausible arguments make such an entity ill-defined?

Clearly it will not have a place on the Tarski hierarchy, since it

refers to all levels.

In a loopy theory, I'd think it would turn out to be one of those

things about which we must remain silent. (That is, a a loopy theory

like Kripke's theory of truth, but applied to probability as well...)

Of course, these strategies are (imo) not "referentially complete",

which is the whole point...

Overall, my feeling is that any treatment of the more standard

paradoxes tends to carry over to a treatment of this paradox.

--Abram

2009/5/28 Wei Dai <wei...@weidai.com>:

Glad to hear from you. I am struck anew by the simplicity of this

paradox every time I look at it.

For those on this list not familiar with the background material for

this paradox, it essentially relies on a formalization of Occam's

Razor. The basic assumption is that, in absence of evidence, the only

deciding factor between two theories is their length written down--

probability decreases as length increases. So, anything with a short

description length is fairly probable.

I've attempted to reply to this paradox before, but I only sort of

waved my hands and pointed to assumptions that might be questionable.

Right now, my first thought is that the lexographically least object

such that P(x) < 1/3 ^^^ 3 is ill-defined. (For those not familiar

with knuth's up-arrow notation, I recommend looking it up, just

because it is fun. Although, really, all you need to know is that

1/3^^^3 is a very small probability with a fairly short description.)

So, what plausible arguments make such an entity ill-defined?

Clearly it will not have a place on the Tarski hierarchy, since it

refers to all levels.

In a loopy theory, I'd think it would turn out to be one of those

things about which we must remain silent. (That is, a a loopy theory

like Kripke's theory of truth, but applied to probability as well...)

Of course, these strategies are (imo) not "referentially complete",

which is the whole point...

Overall, my feeling is that any treatment of the more standard

paradoxes tends to carry over to a treatment of this paradox.

--Abram

2009/5/28 Wei Dai <wei...@weidai.com>:

May 29, 2009, 9:20:42 PM5/29/09

to one-...@googlegroups.com

Abram Demski wrote:

> I've attempted to reply to this paradox before, but I only sort of

> waved my hands and pointed to assumptions that might be questionable.

> I've attempted to reply to this paradox before, but I only sort of

> waved my hands and pointed to assumptions that might be questionable.

Did you post your previous reply somewhere?

> Right now, my first thought is that the lexographically least object

> such that P(x) < 1/3 ^^^ 3 is ill-defined.

But that object is well-defined for any particular P that's well-defined,

isn't it?

> Clearly it will not have a place on the Tarski hierarchy, since it

> refers to all levels.

I might be out of my depth here, but it seems to me that the

lexicographically least object x such that P(x) < 1/3 ^^^ 3 belongs to at

most one level above P on the Tarski hierarchy. Does the trouble start if we

try to quantify over all well-defined P and conclude that a well-defined

universal distribution doesn't exist? If so, we're still left with an easy

way to shoot down any particular well-defined candidate someone might

propose for a universal distribution.

BTW, I've found some academic research into whether it's ok to quantify over

"indefinitely extensible totalities". See section 11 of

http://www.st-andrews.ac.uk/~arche/old/pages/papers/wright-shapiro%201.pdf

for example. I'm not sure if any of that helps here though.

May 30, 2009, 7:34:10 PM5/30/09

to one-...@googlegroups.com

Wei Dai,

> Did you post your previous reply somewhere?

I sent it to you, if I recall correctly.

All I did was question the assumption that description length should

be the only factor in determining probability. I still find this

objection plausible, but perhaps not the best objection. After all, I

do not take your paradox as *proof* that description length should not

be the only factor.

Does the paradox rely critically on an assumption that description

length should be the only factor? Or can you derive the paradox just

from the assumption that the prior should dominate other priors?

> Does the trouble start if we

> try to quantify over all well-defined P and conclude that a well-defined

> universal distribution doesn't exist?

That was my intention, yes.

> universal distribution doesn't exist? If so, we're still left with an easy

> way to shoot down any particular well-defined candidate someone might

> propose for a universal distribution.

Yes, sounds right. In such an approach the prior would fail to be

well-defined. The semantics of the logic would fail to be well-defined

according to itself, too, so that is not surprising.

So, what I am saying is that I do not agree with *current* solutions

to these problems, but for those solutions, it looks like everything

carries over just fine to your paradox. So it seems as if any *actual*

solution will have the same property.

On the other hand, a solution might still fall prey to your paradox,

but give a good story as to why the situation is acceptable. This

seems fairly plausible, too.

--Abram

> Did you post your previous reply somewhere?

All I did was question the assumption that description length should

be the only factor in determining probability. I still find this

objection plausible, but perhaps not the best objection. After all, I

do not take your paradox as *proof* that description length should not

be the only factor.

Does the paradox rely critically on an assumption that description

length should be the only factor? Or can you derive the paradox just

from the assumption that the prior should dominate other priors?

> Does the trouble start if we

> try to quantify over all well-defined P and conclude that a well-defined

> universal distribution doesn't exist?

> universal distribution doesn't exist? If so, we're still left with an easy

> way to shoot down any particular well-defined candidate someone might

> propose for a universal distribution.

well-defined. The semantics of the logic would fail to be well-defined

according to itself, too, so that is not surprising.

So, what I am saying is that I do not agree with *current* solutions

to these problems, but for those solutions, it looks like everything

carries over just fine to your paradox. So it seems as if any *actual*

solution will have the same property.

On the other hand, a solution might still fall prey to your paradox,

but give a good story as to why the situation is acceptable. This

seems fairly plausible, too.

--Abram

May 30, 2009, 10:46:06 PM5/30/09

to one-...@googlegroups.com

> I sent it to you, if I recall correctly.

Yes, I see that we had an email conversation on this topic last June. I

apologize for my poor memory.

> Does the paradox rely critically on an assumption that description

> length should be the only factor? Or can you derive the paradox just

> from the assumption that the prior should dominate other priors?

Ok, here's a version based on dominance: for any probability distribution P

definable at level L of the Tarski hierarchy, there exists a P' at level at

most L+1 which it doesn't dominate. Let P'(x) = 1/3^n if x is the

lexicographically least object such that P(x) < 1/3^^^n, otherwise P'(x) =

c*P(x), with c chosen so that P' sums to 1.

> So, what I am saying is that I do not agree with *current* solutions

> to these problems, but for those solutions, it looks like everything

> carries over just fine to your paradox. So it seems as if any *actual*

> solution will have the same property.

Yes, I'm hoping that's the case too, which is why I'm here. :)

> On the other hand, a solution might still fall prey to your paradox,

> but give a good story as to why the situation is acceptable. This

> seems fairly plausible, too.

I'm still thinking through implications of Eliezer's response, which seems

to say that the situation is acceptable. Does his response carry over to the

other paradoxes?

Jun 2, 2009, 9:58:40 AM6/2/09

to one-...@googlegroups.com

Wei,

Hmmm. With respect to inductive priors, I understand Eliezer's

response as saying that what *I'm* doing, in asking for a better

prior, is (at best) only making *more specific* requirements on a

prior-- since it will be computably approximated, there is no way it

can strictly dominate the solomonoff prior. In other words: any

description-length-based prior over some computable deduction system

*is* a solomonoff prior, if the deduction system is turing-complete;

but I'm hoping it may also have some additional desirable properties.

(Such properties include explicit recognition of a halting oracle in

the environment, if there is sufficient evidence for it.)

How would Eliezer's response carry over to other paradoxes?

"First-order logic does not necessarily assume that the domain of

discourse is computable, but rather that the logic must be run on a

computer." Something like that?

--Abram

Hmmm. With respect to inductive priors, I understand Eliezer's

response as saying that what *I'm* doing, in asking for a better

prior, is (at best) only making *more specific* requirements on a

prior-- since it will be computably approximated, there is no way it

can strictly dominate the solomonoff prior. In other words: any

description-length-based prior over some computable deduction system

*is* a solomonoff prior, if the deduction system is turing-complete;

but I'm hoping it may also have some additional desirable properties.

(Such properties include explicit recognition of a halting oracle in

the environment, if there is sufficient evidence for it.)

How would Eliezer's response carry over to other paradoxes?

"First-order logic does not necessarily assume that the domain of

discourse is computable, but rather that the logic must be run on a

computer." Something like that?

--Abram

Jun 10, 2009, 2:05:42 AM6/10/09

to one-...@googlegroups.com

> (Such properties include explicit recognition of a halting oracle in

> the environment, if there is sufficient evidence for it.)

> the environment, if there is sufficient evidence for it.)

I have found some discussion about how human beings might be able to

recognize a halting oracle:

http://www.cs.nyu.edu/pipermail/fom/2004-March/008010.html

> "First-order logic does not necessarily assume that the domain of

> discourse is computable, but rather that the logic must be run on a

> computer." Something like that?

But what's the counterpart of Solomonoff's Universal Distribution in this

analogy? How about this instead: "Consider a language L that contains all

truths that can be proved by computable minds. We know that there are truths

that L does not contain, but we can never prove any of them so there is no

point trying to extend L any further (assuming we are computable)." Can we

define this L formally?

Jun 10, 2009, 2:58:15 PM6/10/09

to one-...@googlegroups.com

Wei Dai,

This proposal seems wrong to me. A language that contains all truths

that can be proved by computable logics will simply contain all truths

that can be stated finitely! For each truth, there is a computable

logic that just spits out that one statement. So, your logic would

just contain all true finite sentences. Practically, what this means

is, we can always usefully extend a computable logic by adding more

true axioms. (The trouble, of course, is knowing which axioms are

true.)

For me, this is an indication that finding the widest set of knowable

mathematical truths isn't the real problem (though it is still an

interesting problem). The more basic question is what the broadest set

of meaningful mathematical *statements* is. IE, what is the universal

language? Set theory? Kripke's theory of truth? Etc.

--Abram

This proposal seems wrong to me. A language that contains all truths

that can be proved by computable logics will simply contain all truths

that can be stated finitely! For each truth, there is a computable

logic that just spits out that one statement. So, your logic would

just contain all true finite sentences. Practically, what this means

is, we can always usefully extend a computable logic by adding more

true axioms. (The trouble, of course, is knowing which axioms are

true.)

For me, this is an indication that finding the widest set of knowable

mathematical truths isn't the real problem (though it is still an

interesting problem). The more basic question is what the broadest set

of meaningful mathematical *statements* is. IE, what is the universal

language? Set theory? Kripke's theory of truth? Etc.

--Abram

Jun 10, 2009, 11:34:15 PM6/10/09

to One Logic

Eliezer Yudkowsky wrote:

> Solomonoff's universal distribution isn't trying to assume the

> universe is computable. It's trying to beat any computable

> probability distribution you could put over the universe. In other

> words, if your attempt to *calculate*, to *talk about* and *reason

> about* the universe is computable - even if you are talking *about*

> systems that you think are uncomputable - then your ratiocinations

> appear somewhere in Solomonoff induction.

> Solomonoff's universal distribution isn't trying to assume the

> universe is computable. It's trying to beat any computable

> probability distribution you could put over the universe. In other

> words, if your attempt to *calculate*, to *talk about* and *reason

> about* the universe is computable - even if you are talking *about*

> systems that you think are uncomputable - then your ratiocinations

> appear somewhere in Solomonoff induction.

Damn my poor memory again. I just remembered that I already came up with an

example where a human being can beat Solomonoff's universal distribution.

It's at http://www.overcomingbias.com/2008/11/complexity-and/. But Eliezer

wasn't convinced that the game was fair.

Let me try a different example here. Consider a hyperintelligent being (i.e.

someone who has access to a hypercomputer) who wants to find a

(hypercomputable) input sequence and a computable algorithm that will

humiliate Solomonoff's universal distribution: the computable algorithm will

predict the sequence better than the UD, by an infinite margin. The game is,

in each round each agent will output a single prediction for the next

symbol. UD loses the game if the computable algorithm's number of correct

predictions minus its number of correct predictions goes to positive

infinity. Does this seem fair?

To make things simple, the HI chooses the computable algorithm to be "always

predict 0". And it chooses the input sequence by simulating the UD-based

predictor: if the predictor predicts 0 as the next symbol, then it chooses

1, otherwise it chooses 0. Now consider what happens. The computable

algorithm gains a point whenever the UD predicts a 1. And that will occur an

infinite number of times, because every time it predicts a 0 it will see a

1, and it will eventually predict a 1 after seeing enough 1s. (I can try to

prove this if it's not intuitively obvious to you.)

So, Solomonoff's universal distribution is guaranteed to beat (technically,

not lose by more than a finite margin) any computable distribution, *if the

input distribution is computable*, otherwise it can be beaten by a

computable distribution. Maybe this explains why we failed to "carry over"

Eliezer's response to the other paradoxes.

Jun 11, 2009, 12:51:49 AM6/11/09

to One Logic

On Jun 10, 8:34 pm, "Wei Dai" <wei...@weidai.com> wrote:

>

> Let me try a different example here. Consider a hyperintelligent being (i.e.

> someone who has access to a hypercomputer) who wants to find a

> (hypercomputable) input sequence and a computable algorithm that will

> humiliate Solomonoff's universal distribution: the computable algorithm will

> predict the sequence better than the UD, by an infinite margin. The game is,

> in each round each agent will output a single prediction for the next

> symbol. UD loses the game if the computable algorithm's number of correct

> predictions minus its number of correct predictions goes to positive

> infinity. Does this seem fair?

Actually, no. I believe the proofs are over probabilities, and the
>

> Let me try a different example here. Consider a hyperintelligent being (i.e.

> someone who has access to a hypercomputer) who wants to find a

> (hypercomputable) input sequence and a computable algorithm that will

> humiliate Solomonoff's universal distribution: the computable algorithm will

> predict the sequence better than the UD, by an infinite margin. The game is,

> in each round each agent will output a single prediction for the next

> symbol. UD loses the game if the computable algorithm's number of correct

> predictions minus its number of correct predictions goes to positive

> infinity. Does this seem fair?

most elegant proofs are over accumulated log probabilities (Bayes-

score). So what we want to do is establish that in the limit, and

regardless of the environment, whether the environment is computable

or not, there is no computable algorithm whose accumulated log-score

beats the Solomonoff distribution - an (uncomputable) weighted mixture

of all computable mixtures. And then the proof seems relatively

trivial, since any computable algorithm has a finite weight in this

mixture. Indeed the difference in their scores is bounded by (the log

of) the initial finite weight, unless I've missed something.

So it's possible that you could put forth some sort of theorem about a

computable decision process being defeated by an uncomputable

environment. In fact this is quite easy - let a sequence of ladies

and tigers always be placed by the environment to make a decision

process perfectly wrong. Then it will be dominated by *any* other

process that is not perfectly identical to it, including, probably,

"always choose the door on the right", and so on.

But *epistemically* speaking, Solomonoff induction cannot do

infinitely worse than anything computable. In the lady and the tiger

case, I expect it would assign probabilities closer and closer to

exactly 50%, so that despite being "wrong" on every occasion - despite

assigning greater than 50% probability to the wrong answer on each

occasion - the margin of greaterness would become so tiny that it

would lose by only a finite amount to an algorithm that treated the

ladies and tigers as a fair coin. (And of course there is no

computable algorithm that actually makes the correct predictions!)

Jun 11, 2009, 5:31:15 PM6/11/09

to One Logic

Eliezer Yudkowsky wrote:

> Actually, no. I believe the proofs are over probabilities, and the

> most elegant proofs are over accumulated log probabilities (Bayes-

> score).

> Actually, no. I believe the proofs are over probabilities, and the

> most elegant proofs are over accumulated log probabilities (Bayes-

> score).

Solomonoff's original proof was over probabilities, but other proofs are

over the number of errors. From http://www.hutter1.net/ai/perrbnd.pdf:

Another natural question is to ask for relations between the total number of

expected

errors E_ξ in Solomonoff prediction and the total number of prediction

errors Eμ in the

informed scheme. Unfortunately [9] does not bound E_ξ in terms of Eμ in a

satisfactory

way. For example it does not exclude the possibility of an infinite E_ξ

even if Eμ is finite.

Here we want to prove upper bounds to E_ξ in terms of Eμ ensuring as a

corollary that

the above case cannot happen.

> So what we want to do is establish that in the limit, and

> regardless of the environment, whether the environment is computable

> or not, there is no computable algorithm whose accumulated log-score

> beats the Solomonoff distribution - an (uncomputable) weighted mixture

> of all computable mixtures. And then the proof seems relatively

> trivial, since any computable algorithm has a finite weight in this

> mixture. Indeed the difference in their scores is bounded by (the log

> of) the initial finite weight, unless I've missed something.

I think you're right, but only if the computable algorithm isn't allowed to

use noncomputable notation to represent its beliefs. Suppose a human

believes that the next symbol is 0 with probability 1/BB(100), and the

payoff for guessing 0 correctly is BB(101), then he can decide to bet on 0

without actually being able to compute BB(100). So I see no reason why such

beliefs should not be allowed. But if you do allow such beliefs, then the

Solomonoff distribution is no longer guaranteed to be the best

epistemically.

So, given that the Solomonoff distribution is not guaranteed to be best

either epistemically or in decision making, there seems to be no reason to

stop looking for better alternatives.

Jun 11, 2009, 8:23:00 PM6/11/09

to One Logic

On Jun 11, 2:31 pm, "Wei Dai" <wei...@weidai.com> wrote:

> Eliezer Yudkowsky wrote:

> > Actually, no. I believe the proofs are over probabilities, and the

> > most elegant proofs are over accumulated log probabilities (Bayes-

> > score).

>

> Solomonoff's original proof was over probabilities, but other proofs are

> over the number of errors.

Hey, well, more the worse for them, I'd say. If you're dealing with
> Eliezer Yudkowsky wrote:

> > Actually, no. I believe the proofs are over probabilities, and the

> > most elegant proofs are over accumulated log probabilities (Bayes-

> > score).

>

> Solomonoff's original proof was over probabilities, but other proofs are

> over the number of errors.

non-exceptional situations - non-devilish environments - then

shouldn't a proof of *epistemic* error-boundedness generally carry

over to a proof of decision error-boundedness? In other words, are

you sure you're not assuming that the environment is a

superintelligent adversary which is strictly more superintelligent

than you, which is the sort of reasoning that leads people to adopt

randomized algorithms? In general, even though I see epistemic

rationality as ultimately deriving from instrumental rationality, I

think that epistemics are simpler than decisions, and so I would trust

more a proof that began with epistemic correctness and moved from

there to instrumental correctness.

> > So what we want to do is establish that in the limit, and

> > regardless of the environment, whether the environment is computable

> > or not, there is no computable algorithm whose accumulated log-score

> > beats the Solomonoff distribution - an (uncomputable) weighted mixture

> > of all computable mixtures. And then the proof seems relatively

> > trivial, since any computable algorithm has a finite weight in this

> > mixture. Indeed the difference in their scores is bounded by (the log

> > of) the initial finite weight, unless I've missed something.

>

> I think you're right, but only if the computable algorithm isn't allowed to

> use noncomputable notation to represent its beliefs. Suppose a human

> believes that the next symbol is 0 with probability 1/BB(100), and the

> payoff for guessing 0 correctly is BB(101), then he can decide to bet on 0

> without actually being able to compute BB(100). So I see no reason why such

> beliefs should not be allowed. But if you do allow such beliefs, then the

> Solomonoff distribution is no longer guaranteed to be the best

> epistemically.

contains all computable and consistent probability mixtures. Is the

idea that there are concrete decision problems where we can do better

by admitting computable yet inconsistent probability mixtures? This,

to say the least, is a non-obvious idea, and it needs to be cashed out

a bit more clearly in terms of the admissible language of the problem

and its solution.

Any concrete sequence of sensory data can be predicted by Solomonoff

as well as by any computable consistent mixture, up to a bounded total

probabilistic error. So what sort of decision problem are we talking

about, with a computable decision algorithm, which goes outside this

formalism, in a way that enables it to exhibit decisive instrumental

superiority despite its lack of decisive epistemic superiority? Is it

only pathological cases with adversarial superior superintelligences

that can would be expected to produce an infinite series of lady-and-

tiger bad bets, or is there some more natural case?

> So, given that the Solomonoff distribution is not guaranteed to be best

> either epistemically or in decision making, there seems to be no reason to

> stop looking for better alternatives.

here than I am in my blog posts - be sure to let me know.

Jun 11, 2009, 9:10:53 PM6/11/09

to one-...@googlegroups.com

Eliezer, Wei Dai,

I am thinking that a better way of interpreting Wei Dai is to add

*incompletely specified* probability distributions, not inconsistent

ones. So, for example, a probability distribution that is described in

terms of the busy beaver function would be incompletely specified,

because the logic we're using will only be able to prove some facts

about it.

In Wei Dai's example, the epistemology and the decision theory

interact in an interesting way-- so that although we cannot compute

the utility, we know something about how it compares to the

probability.

It is hard to see exactly how this would lead to action, however,

since a specific utility value is not known.

--Abram

I am thinking that a better way of interpreting Wei Dai is to add

*incompletely specified* probability distributions, not inconsistent

ones. So, for example, a probability distribution that is described in

terms of the busy beaver function would be incompletely specified,

because the logic we're using will only be able to prove some facts

about it.

In Wei Dai's example, the epistemology and the decision theory

interact in an interesting way-- so that although we cannot compute

the utility, we know something about how it compares to the

probability.

It is hard to see exactly how this would lead to action, however,

since a specific utility value is not known.

--Abram

Jun 11, 2009, 11:00:48 PM6/11/09

to one-...@googlegroups.com

To restate my idea, a computable decision process isn't necessarily limited

to using a computable probability distribution. Suppose he has a formally

specified, consistent probability function that isn't computable. He may be

able to prove that EU(A) > EU(B) even if he can't compute EU(A) and EU(B),

where A and B are choices that he faces. (This can be the case whether or

not his utility function is computable.) This way, he can possibly make

better decisions than an agent using Solomonoff's distribution. If you ask

him to give you a prediction about something as a probability, he might have

to give you the probability in a non-algorithmic notation, like 1/BB(100).

Internally, he is also reasoning about probabilities and expected utilities

in these non-algorithmic forms, so it seems reasonable to say that he in

fact has P(0) = 1/BB(100) as an epistemic belief.

to using a computable probability distribution. Suppose he has a formally

specified, consistent probability function that isn't computable. He may be

able to prove that EU(A) > EU(B) even if he can't compute EU(A) and EU(B),

where A and B are choices that he faces. (This can be the case whether or

not his utility function is computable.) This way, he can possibly make

better decisions than an agent using Solomonoff's distribution. If you ask

him to give you a prediction about something as a probability, he might have

to give you the probability in a non-algorithmic notation, like 1/BB(100).

Internally, he is also reasoning about probabilities and expected utilities

in these non-algorithmic forms, so it seems reasonable to say that he in

fact has P(0) = 1/BB(100) as an epistemic belief.

Does that clear it up?

(Of course in general he's likely to have choices C and D such that he can't

prove either EU(C) > EU(D) or EU(D) > EU(C). How to make decisions in cases

like that would depend on a theory of how to handle mathematical

uncertainty, which I don't have...)

Eliezer asks for more natural examples. That's reasonable and something I'll

think about.

Jun 12, 2009, 3:02:23 PM6/12/09

to one-...@googlegroups.com

I've come up with a more natural example of a decision problem where

Solomonoff's distribution does infinitely worse than a human/computable

agent. It doesn't depend on an adversarial superintelligence. Consider the

sequence of unary encodings of BB(1), BB(2), BB(4), ... again. It occurs to

me that Solomonoff's distribution actually makes two kinds of mistakes. The

one I stated at http://www.overcomingbias.com/2008/11/complexity-and/ was

that after seeing 1s long enough, it eventually predicts a probability for 0

that's too low. The other mistake is, for a quite a while after seeing a 0,

it predicts a probability for 0 that's too high. It's easier to construct an

example using the latter error instead of the former.

Solomonoff's distribution does infinitely worse than a human/computable

agent. It doesn't depend on an adversarial superintelligence. Consider the

sequence of unary encodings of BB(1), BB(2), BB(4), ... again. It occurs to

me that Solomonoff's distribution actually makes two kinds of mistakes. The

one I stated at http://www.overcomingbias.com/2008/11/complexity-and/ was

that after seeing 1s long enough, it eventually predicts a probability for 0

that's too low. The other mistake is, for a quite a while after seeing a 0,

it predicts a probability for 0 that's too high. It's easier to construct an

example using the latter error instead of the former.

So consider a game where in each round the player is asked to bet on 0 or 1.

The payoff for 0 in round i is equal to 1 minus the i-th bit of the sequence

of unary encodings of BB(1), BB(2), BB(4), ..., and the payoff for 1 is

always 1/2^100.

It seems clear that for a human, except for the first few rounds where he

has a reasonable chance of guessing BB(2^n) correctly, the best strategy is

to always bet on 1. Consider the situation after he sees BB(2^n). He knows

that if he keeps betting on 0, by the time he gets a payoff of 1, he could

have gotten a total payoff of BB(2^(n+1))/2^100 by betting on 1 instead.

The uncomputable probability distribution that the human is using here is

P_i(1) = i-th bit of BB(1), BB(2), BB(4), ... (This is different from what I

suggested before.) He doesn't know how to compute this, obviously, nor the

expected utility of betting on 0 in any particular round, but he can prove

that the strategy "always bet on 1" has a higher expected payoff than

"always bet on 0" and it is in fact better than any other computable

strategy.

What about the agent using Solomonoff's distribution? After seeing

BB(1),...,BB(2^n), the algorithmic complexity of BB(1),...,BB(2^n) is sunk,

so to speak. It will predict a higher expected payoff for playing 0 in any

round i where the conditional complexity K(i | BB(1),...,BB(2^n)) < 100.

This includes for example 2*BB(2^n), 2*BB(2^n)+1, BB(2^n)^2 * 3 + 4,

BB(2^n)^^^3, etc. It will bet on 0 in these rounds (erroneously, since

K(BB(2^(n+1)) | BB(2^n)) > 100 for large n), and therefore lose relative to

a human.

Is this example more convincing?

Oh, in case there's objection that the frequency of Solomonoff's

distribution losing goes to 0 too quickly in this game, we can replace the

sequence BB(2^n) by some other mathematically well-defined sequence S(n)

such that K(S(n+1) | S(1), ..., S(n)) > 100, but S(n) doesn't increase so

quickly. For example, let S(n+1) be the smallest integer such that K(S(n+1)

| S(1), ..., S(n)) > 100. This sequence increases slowly, since S(n) + a

random 101-bit number has conditional complexity more than 100, so S(n+1)

can't be bigger than S(n)+2^101.

Jun 19, 2009, 3:26:00 AM6/19/09

to one-...@googlegroups.com

I wrote earlier:

> To restate my idea, a computable decision process isn't necessarily

> limited

> to using a computable probability distribution. Suppose he has a formally

> specified, consistent probability function that isn't computable. He may

> be

> able to prove that EU(A) > EU(B) even if he can't compute EU(A) and EU(B),

> where A and B are choices that he faces.

> To restate my idea, a computable decision process isn't necessarily

> limited

> to using a computable probability distribution. Suppose he has a formally

> specified, consistent probability function that isn't computable. He may

> be

> able to prove that EU(A) > EU(B) even if he can't compute EU(A) and EU(B),

> where A and B are choices that he faces.

I recalled that Jürgen Schmidhuber's Gödel Machine already contains ideas

that are very similar to the above. See specifically question 4, page 22 in

ftp://ftp.idsia.ch/pub/juergen/gm6.pdf. So I'll recap how my position

differs from his:

1. A Gödel machine starts with an initial strategy and tries to find

provable improvements to it. Since humans can handle mathematical

uncertainty, I think machines should be able to as well (even if we don't

know how yet). Therefore I suggest that a machine should look for *likely*

improvements that are not necessarily provable.

2. A Gödel machine has fixed "environment axioms", essentially a

generalization of the Bayesian prior. Schmidhuber writes "All such prior

assumptions are perfectly formalizable in an appropriate A (otherwise we

could not write scientific papers about them)." But as I pointed out earlier

in this thread, we don't know how to formalize Occam's Razor, if we

interpreted it as something like "the environment likely has a short

mathematical description".

Jun 19, 2009, 12:45:16 PM6/19/09

to one-...@googlegroups.com

Wei Dai,

Yesterday, I wrote up some thoughts on why uncomputable models might

be useful in everyday life:

http://dragonlogic-ai.blogspot.com/2009/06/importance-of-uncomputable-models-i.html

--Abram

Yesterday, I wrote up some thoughts on why uncomputable models might

be useful in everyday life:

http://dragonlogic-ai.blogspot.com/2009/06/importance-of-uncomputable-models-i.html

--Abram

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu