# Re: [MAGIC-list] Choice of the reference model [was: AIXI limitations]

96 views

### Laurent

Aug 10, 2011, 11:48:33 AM8/10/11
Hi all,

For the record, my first email was:
Up to now, here are my main concerns with AIXI:
• Reference Turing machine
• There is currently no best choice, but it seems intuitive that small TMs are better than any random TM
• Reinforcement Learning
•  RL as a utility function is very convenient for simplicity, but prone to self-delusion. Knowledge-seeking is better but it prefers true random number generators (not generated by itself). I'm currently thinking about a utility function that is like knowledge-seeking but, roughly, does not like things that are too complex, or more precisely, it doesn't like things that are predictably random.
• Incomputability
• Time dimension not taken into account. That is the current most important problem. If we can solve that in an optimal way, we will have real AGI. I'm also working on things in this corner, but right now I'd prefer to keep my ideas for myself :)
• Things I don't understand yet...

On Wed, Aug 10, 2011 at 17:22, Tom Everitt wrote:

Regarding limitations of the AIXI, I'd like to second Laurent on the point that the completely free choice of reference machine is a bit unsatisfying.

If I haven't missed anything, this means, among other things, that if we have two strings:

a = 111 b = 100110101010010111010101010101 001111010101

we can not objectively say that a is simpler than b, since according to some obscure languages (reference machines) b will actually be simpler than a.

Exactly.
But intuitively, a being simpler than b makes a lot more sense. Why?
In most programming languages, a would be far simpler than b, although both may turn out to be coded like "print 'aaa'"...

Now consider:
a = 111
b = 10101010

Which is simpler?
The question is more difficult, but it still seem plausible that a is simpler.
Even harder :
a = 111111...
b = 101010...

a still looks simpler, but if I had to bet my arm on it, I wouldn't risk it for less than ten million euros (not sure I would even risk it, my arm is quite useful).

As Laurent already said, shorter reference machine makes more sense intuitively. But appealing to intuition here is clearly not so much better than motivating a cognitive architecture intuitively - it works in practice, but is mathematically unsatisfying. Also, can one really say that one reference machine is shorter/simpler than another? doesn't that also depend on the language in which one describes the reference machines?

Yes absolutely.
*If* we choose a "reference class of models" (e.g., Turing machines), then we can choose the smallest reference machine.

I thought about another option though, just a while back:
Instead (but still considering the TMs class), why not choose the reference UTM that orders TMs in the exact same order as the "natural" order, which is to grow TMs depending on the number of states, of transitions, and some lexicographical order?
(my first problem then was that there was too many TMs of same complexity)

Now, there still also remains the problem of the reference class of models. Why should we go for TMs? Why not something else? How to decide?
TMs have a nice, very simple formulation given our human knowledge. In the end, does it all boils down to our own world? Should we choose the simplest formulation given the axioms of our world? Is that even possible?

This bothers me a lot. Does anyone know of any attempts to resolve this issue? It must at least have been attempted I feel.

Hutter pointed me to a failed attempt by Müller, IIRC. I don't have the exact reference on this computer.
The idea was to simulate TMs on other TMs, and do that in loop, to try to find some fixed point, or something in the genre, but that did not work.

Also, Hutter has some discussions on these matters in his 2005 book and in :
"A Philosophical Treatise of Universal Induction"
http://www.hutter1.net/official/bib.htm#uiphil

On last resort, we can set for a consensus on a  "intuitive" best choice, if we can prove that finding the best model is not feasible. But if we can avoid that, I'd be happier.

Laurent

### Tom Everitt

Aug 10, 2011, 1:37:14 PM8/10/11
Hi, thanks a lot for your input!

On Wed, Aug 10, 2011 at 5:48 PM, Laurent wrote:

As Laurent already said, shorter reference machine makes more sense intuitively. But appealing to intuition here is clearly not so much better than motivating a cognitive architecture intuitively - it works in practice, but is mathematically unsatisfying. Also, can one really say that one reference machine is shorter/simpler than another? doesn't that also depend on the language in which one describes the reference machines?

Yes absolutely.
*If* we choose a "reference class of models" (e.g., Turing machines), then we can choose the smallest reference machine.

I thought about another option though, just a while back:
Instead (but still considering the TMs class), why not choose the reference UTM that orders TMs in the exact same order as the "natural" order, which is to grow TMs depending on the number of states, of transitions, and some lexicographical order?
(my first problem then was that there was too many TMs of same complexity)

Now, there still also remains the problem of the reference class of models. Why should we go for TMs? Why not something else? How to decide?
TMs have a nice, very simple formulation given our human knowledge. In the end, does it all boils down to our own world? Should we choose the simplest formulation given the axioms of our world? Is that even possible?

This is good news. I hadn't actually realized that the structure of the TM is objective, and that some structures are arguably simpler than other. The smallest UTM found seems to have 22 states, but it doesn't seem to be proven minimal. http://en.wikipedia.org/wiki/Universal_Turing_machine#Smallest_machines

This bothers me a lot. Does anyone know of any attempts to resolve this issue? It must at least have been attempted I feel.

Hutter pointed me to a failed attempt by Müller, IIRC. I don't have the exact reference on this computer.
The idea was to simulate TMs on other TMs, and do that in loop, to try to find some fixed point, or something in the genre, but that did not work.

This is probably the article you mean http://arxiv.org/abs/cs/0608095. It sounds interesting, I will definitely read it. Unfortunately it seems that they conclude it is not really possible to find a simplest, universal computer:

"Moreover, we show that the reason for failure has a clear and interesting physical interpretation, suggesting that every other conceivable attempt to get rid of those additive constants must fail in principle, too."

Also, Hutter has some discussions on these matters in his 2005 book and in :
"A Philosophical Treatise of Universal Induction"
http://www.hutter1.net/official/bib.htm#uiphil

Looks interesting, too.

On last resort, we can set for a consensus on a  "intuitive" best choice, if we can prove that finding the best model is not feasible. But if we can avoid that, I'd be happier.

Yes, as I see it now, there are two "possible" paths to finding the ultimate reference machine. One is to try to order the UTM's "mathematically", but perhaps in a different way than Muller tried. The other, more philosophical approach, would be to find some structure for "turing-complete computing structures" in general, and apply a similar argument to the one you gave above on Turing machines.

But I'm definitely going to read Muller's argument about that this hole question is "hopeless", if he's convincing I guess we'll have to resort to that intuitive best choice of yours.

Tom

### Laurent

Aug 10, 2011, 2:03:36 PM8/10/11
Now, there still also remains the problem of the reference class of models. Why should we go for TMs? Why not something else? How to decide?
TMs have a nice, very simple formulation given our human knowledge. In the end, does it all boils down to our own world? Should we choose the simplest formulation given the axioms of our world? Is that even possible?

This is good news. I hadn't actually realized that the structure of the TM is objective, and that some structures are arguably simpler than other. The smallest UTM found seems to have 22 states, but it doesn't seem to be proven minimal. http://en.wikipedia.org/wiki/Universal_Turing_machine#Smallest_machines

Wolfram's 2,3 TM is even smaller, but the computation model is a bit limited, it seems.

This bothers me a lot. Does anyone know of any attempts to resolve this issue? It must at least have been attempted I feel.

Hutter pointed me to a failed attempt by Müller, IIRC. I don't have the exact reference on this computer.
The idea was to simulate TMs on other TMs, and do that in loop, to try to find some fixed point, or something in the genre, but that did not work.

This is probably the article you mean http://arxiv.org/abs/cs/0608095. It sounds interesting, I will definitely read it.

This is exactly the one, I'm impressed you found it!

Unfortunately it seems that they conclude it is not really possible to find a simplest, universal computer:

"Moreover, we show that the reason for failure has a clear and interesting physical interpretation, suggesting that every other conceivable attempt to get rid of those additive constants must fail in principle, too."

I would not be so categorical myself, but who knows.

On last resort, we can set for a consensus on a  "intuitive" best choice, if we can prove that finding the best model is not feasible. But if we can avoid that, I'd be happier.

Yes, as I see it now, there are two "possible" paths to finding the ultimate reference machine. One is to try to order the UTM's "mathematically", but perhaps in a different way than Muller tried. The other, more philosophical approach, would be to find some structure for "turing-complete computing structures" in general, and apply a similar argument to the one you gave above on Turing machines.

Yes. Actually, I'd like very much a computation model that has many symmetries. I'd call that the "spherical model of computation". Now that we have the name, we just need to define the model, that should be easy ;)

Laurent

### Tom Everitt

Aug 10, 2011, 2:27:16 PM8/10/11

Hehehe, I hope you're right :) Seriously though, I totally agree about that the philosophical solution would be the nicest by far. The ultimate insight on what computation is... I wonder if it's possible. I'll let you know if I get anywhere near an idea on what that would look like, and I don't think I need to tell you that I'm all ears if you come up with something :)

### Laurent

Aug 10, 2011, 2:37:48 PM8/10/11
Hehehe, I hope you're right :) Seriously though, I totally agree about that the philosophical solution would be the nicest by far. The ultimate insight on what computation is... I wonder if it's possible.

Maybe the smallest possible Turing-complete (save for infinite memory) computers in our universe?
The problem is that they don't look so "simple"...  But maybe that's because of our biased macro-view human perspective.

Laurent

### Abram Demski

Aug 10, 2011, 2:41:16 PM8/10/11
FYI, Tim Tyler is also interested in this question.

I think we can say that, in principle, the best UTM would be the one
which best predicts the environment of the agent. In other words, the
UTM with the most built-in knowledge. The update process is a descent
towards minimal error; in other words, inductive learning can be seen
as a search for a better prior. The posterior of a Bayesian update can
always be treated as a new prior for the next Bayesian update.

The update to a universal distribution remains universal, so we can
even regard the result as a new universal Turing machine, although it
is not explicitly represented as such.

So, I strongly sympathize with the idea that there's no single best
prior. However, that doesn't mean we can't find nice desirable
properties in the directions you guys are discussing. The equivalence
class "Turing Complete" is a very weak one, which doesn't require very
many properties to be preserved. Different models of computation can
change the complexity classes of specific problems, for example.

One property I mentioned to Tom is the ratio of always-meaningful
programs (total functions) to sometimes-meaningless ones (partial
functions). IE, we ask how easy it is to get into an infinite loop
which produces no output. This number has been called "Omega".

It's not possible to have a language which is always-meaningful while
being Turing complete, so we can work at this from "both
directions"... augmenting always-meaningful languages to increase
expressiveness, and partially restricting Turing-complete languages to
be more meaningful. However, there is no perfect middle point-- we can
only get closer in each direction.

--Abram

### Laurent

Aug 10, 2011, 5:08:49 PM8/10/11
On Wed, Aug 10, 2011 at 20:41, Abram Demski wrote:
FYI, Tim Tyler is also interested in this question.

He's just been invited.

I think we can say that, in principle, the best UTM would be the one
which best predicts the environment of the agent. In other words, the
UTM with the most built-in knowledge. The update process is a descent
towards minimal error; in other words, inductive learning can be seen
as a search for a better prior. The posterior of a Bayesian update can
always be treated as a new prior for the next Bayesian update.

The update to a universal distribution remains universal, so we can
even regard the result as a new universal Turing machine, although it
is not explicitly represented as such.

You're right that if we go down the road of quantum physics, we should simply put as much knowledge as possible, to tailor the agent as much as possible toward the real-world.

Let's assume for a moment that we don't have knowledge about the real-world.
The question of the ultimate prior is then: Is there a UTM that is better than any other?
I.e.: Is there a universal notion of simplicity?

This looks somewhat equivalent to: "Is the definition of Turing machines simple because they are tailored to our own world, or is there an elegant mathematical truth behind that?"

Laurent

### Tim Tyler

Aug 10, 2011, 6:14:54 PM8/10/11
to MAGIC
Hi, yes - I do have a page about the issue: http://matchingpennies.com/the_one_true_razor/

I also think it is not a huge deal. The "just use FORTRAN-77" is
reasonable. If you want to give your
machine a head start in the real world, you tell it all the things you
know - e.g. audio/video codecs.
--
__________
|im |yler http://timtyler.org/ t...@tt1lock.org Remove lock to

### Tom Everitt

Aug 11, 2011, 11:19:28 AM8/11/11

On Thu, Aug 11, 2011 at 12:14 AM, Tim Tyler wrote:
Hi, yes - I do have a page about the issue: http://matchingpennies.com/the_one_true_razor/

I also think it is not a huge deal.  The "just use FORTRAN-77" is
reasonable.  If you want to give your
machine a head start in the real world, you tell it all the things you
know - e.g. audio/video codecs.

So a question to you "no one true razor/prior" proponents:

If we have

a = 101,
b = 101101001000001010100101111111111010

do you not share the intuition that a is, in some sense, objectively simpler than b?

Another thing is that priors does matter, since your first prior will always affect your later posteriors... And to choose it so it fits the world well is not much of a solution either, since we want a general solution, no? But I do agree with Abrams point on that it makes sense to look for relatively "meaningful" languages.

On Wed, Aug 10, 2011 at 8:37 PM, Laurent wrote:

Maybe the smallest possible Turing-complete (save for infinite memory) computers in our universe?
The problem is that they don't look so "simple"...  But maybe that's because of our biased macro-view human perspective.

I don't want to say you're wrong, but I guess it would take a pretty argument on why to prefer them to anything else. As Abram points out, the class of turing-complete structures is quite diverse, and even though we can understand any member of it as a set of states and rules (like a Turing-machine), I guess we can also understand any Turing-machine through any of these other structures. But hopefully either simplicity is preserved between structures, or there is one "true" structure, the "spherical model of computation". I like the name :)

### Tim Tyler

Aug 11, 2011, 4:08:58 PM8/11/11
to MAGIC
On Aug 11, 4:19 pm, Tom Everitt <tom4ever...@gmail.com> wrote:

> So a question to you "no one true razor/prior" proponents:
>
> If we have
>
> a = 101,
> b = 101101001000001010100101111111111010
>
> do you not share the intuition that a is, in some sense,
> objectively simpler than b?

Sure, but if you consider:
a = 01010010101010010011010110101101011
...and...
b = 101101001000001010100101111111111010
...then that intuition is not so clear.

### Tom Everitt

Aug 11, 2011, 4:45:07 PM8/11/11
Sure, but my point is:

If the choice of reference machine can be made completely arbitrarily, then in some choices a will be simpler than b, and in other choices b will be simpler than a. No matter what a and b happens to be (as long as they are finite strings).

So to say that there is no "right" choice of reference machine (or class of reference machines), must therefore imply that a=1 is not (can not be!) objectively simpler than b=10101100111111100100100000110101.

Do you see my point?

### Wén Shào

Aug 11, 2011, 7:04:53 PM8/11/11
I agree with Tom Everitt and I had the same problem before, namely, for every string(finite length) \$x\$ there is an additively optimal recursive function \$f\$ such that the Kolmogorov complexity with respect to that particular \$f\$ is 0. I think this problem stems from the arbitrariness in enumeration of Turing machines, the example given by Tom best illustrates this problem. Consider an arbitrary enumeration of Turing machines, a Turing machine TMb that has a very complex string (like b= 10101100111111100100100000110101) "built-in" which happens to appear in the top several, then the complexity of string b with respect to that enumeration, though intuitively large, is low. This arbitrariness doesn't always matter, for example when we consider randomness of an infinite sequence, the additive constant can never be so overwhelming. Nevertheless, when we consider finite strings, it's indeed an annoying issue. There are basically two solutions to this problem as far as I know,

(1) To use some predetermined and universally agreed-upon reference machines, that is to use natural machines.
(2) A more mathematically formal solution is given in LV08 ~p.111.

Just my thought.
--

Wén

### Tom Everitt

Aug 11, 2011, 7:36:41 PM8/11/11
Thank you Wén Shào, that was most helpful! Took me a little while to realize LV08 referred to Li & Vitanyi - An Introduction to Kolmogorov Complexity, but then I found it. Now I just need some time to digest the argument, but it looks really promising.

### Tom Everitt

Aug 22, 2011, 9:33:40 AM8/22/11
Hi again, I finally had time to read Li & Vitanyi a bit more closely.

They start out promising:

"The complexity C(x) is invariant only up to a constant depending on the reference function phi_0. Thus, one may object, for every string x there is an additively optimal recursive function psi_0 such that C_psi_0(x) = 0 [psi_0 assigns complexity 0 to x]. So how can one claim that C(x) is an objective notion?"

They then set about giving a "mathematically clean solution" by defining some equivalence classes, claiming these equivalence classes had a single smallest element. This is all well, but the problem is that according to their definition of equivalence classes, all universal Turing machines end up in the smallest equivalence class. And, clearly, for every string x there is a universal Turing machine assigning complexity 0 to x. (Just take any universal Turing machine, and redefine it so it prints x on no input.)

So in my opinion their argument is not making C(x) any more objective.

Any thoughts?

PS. If anyone wants to see the argument for himself, but don't have the book at hand, I can just scan the page (it's only one page so it's no trouble). Just let me know.

### Wén Shào

Aug 23, 2011, 12:07:43 AM8/23/11
Hi Tom,

I don't quite see why all universal Turing machines have to necessarily end up in the smallest equivalence class. An apt example would be: given a universal Turing machine T, I can effectively construct another UTM T' such that the complexity with respective to T' is uniformly larger than the complexity with respective to T by some large number C. Then T' won't be in the equivalence class of T if I choose c sufficiently small.

Just my thought,

Cheers,

Wén

--
To post to this group, send email to magic...@googlegroups.com
To unsubscribe from this group, send email to
For more options, visit this group at

### Laurent

Aug 23, 2011, 3:09:25 AM8/23/11
Wén, do you feel the proposed solution is appealing, or satisfying? (and why, if possible)
It's strange this solution is rarely quoted, though I suppose Marcus (for example) is well aware of it?
I'll need to have a closer look sometime soon.

Laurent

### Wén Shào

Aug 23, 2011, 5:32:09 AM8/23/11
Hi Tom and Laurent,

Please ignore my comment made before, obviously it didn't' make any sense. Should think more carefully before I speak...:P

I will spend some time looking at their solution more closely, and let you know what I feel then.

Cheers,

Wén

### Wén Shào

Aug 23, 2011, 7:44:18 AM8/23/11
Hi Tom and Laurent,

I agree with Tom, their solution is not quite satisfying. I believe Marcus is well aware of this, and I personally don't think Marcus accepts their argument as well, as whenever he mentions (has to mention) this issue in his papers, he uses "universally agreed-upon enumeration" to "get rid of" arbitrariness.

As far as I know, there isn't a completely satisfying solution yet (I don't believe any of my colleague knows one). Nevertheless, the arbitrariness in choosing universal Turing machine only matters when we examine the complexity of a particular object, which we seldom do.

Just my thought,

Wén

### Laurent

Aug 23, 2011, 8:13:18 AM8/23/11
On Tue, Aug 23, 2011 at 13:44, Wén Shào wrote:
Hi Tom and Laurent,

I agree with Tom, their solution is not quite satisfying. I believe Marcus is well aware of this, and I personally don't think Marcus accepts their argument as well, as whenever he mentions (has to mention) this issue in his papers, he uses "universally agreed-upon enumeration" to "get rid of" arbitrariness.

Thanks Wén, good to know.

As far as I know, there isn't a completely satisfying solution yet (I don't believe any of my colleague knows one). Nevertheless, the arbitrariness in choosing universal Turing machine only matters when we examine the complexity of a particular object, which we seldom do.

Hmm, I'm not so sure.
The reference UTM is also of importance when considering small strings, which we often do.
One could however argue that our experience in the world (genetic and cultural) has made "our reference machine" less arbitrary, and that it would go similarly for an AGI with more than 20 years of age.

Laurent

### Sun Yi

Aug 23, 2011, 2:50:56 PM8/23/11

Hello Guys,

I just finished reading Shane’s paper on the negative results on making Solomonoff induction practical, and a question pop out:

What is the precise linkage between the theory of Solomonoff induction and the reality?

Take probability theory for example: The only linkage between the theory of probability and the reality is ‘the fundamental interpretative hypothesis’, which states ‘that events with zero or low probability are unlikely to occur’ (Shafer and Vovk, ‘Probability and finance: It’s only a game’, pp.5) This makes clear that conclusions drawn from probabilistic inference about the reality make sense only if we accept this hypothesis, and as a indirect consequence, provides theoretical basis for the generalization results in the statistical learning context. (E.g., A is better than B on a set of i.i.d. examples => A is likely to be better than B on all cases otherwise something with small probability will happen.)

When looking at the theory of Solomonoff induction (or the RL extension AIXI), I found such clearly stated linkage is somehow missing. Shane’s negative results show that we cannot say much mathematically about the performance of a complex predictor. However, in the original paper, it is suggested that the creation of AGI would be more or less an experimental science. However, I also found this point problematic. One of the reason is that almost all experimental study requires the comparison between two agents A and B. And if A and B are complex algorithmic objects, and if we adopt Shane’s definition that a predictor is universal if it ‘eventually’ correctly predict a sequence, then there is practically no way to draw any conclusion by looking at the performance of the agents on finite data. As a result, to make an experimental study useful, we have to answer precisely the following question:

What is the fundamental hypothesis we must accept to say that ‘Predictor A is better than B on the given set of finite strings => A is likely to be better than B on all sequences’?

In other word, we must state very clearly under what hypothesis should we belief that the experimental result will generalize, before we start to do any experiment. I don’t have an answer to this problem, yet I think the answer is necessary if we want to make the universal theory practically relevant.

Regards,

Sun Yi

--

### Laurent

Aug 23, 2011, 3:50:59 PM8/23/11
Hi (btw, what's your first name, Yi or Sun?)

That's a very interesting question you raise.

On Tue, Aug 23, 2011 at 20:50, Sun Yi wrote:

Hello Guys,

I just finished reading Shane’s paper on the negative results on making Solomonoff induction practical,

Btw, note the negative result holds for agents too, not only for prediction:
http://www.hutter1.net/official/bib.htm#asyoptag

(And I will discuss Shane's results in another thread.)

and a question pop out:

What is the precise linkage between the theory of Solomonoff induction and the reality?

You may also want to read "A philosophical treatise of Universal induction":
http://www.hutter1.net/official/bib.htm#uiphil

It does not make a direct, irrefutable link with the real world, but it shows how Solomonoff's Induction solves some long-standing philosophical problems about induction, like the Raven's paradox .

Take probability theory for example: The only linkage between the theory of probability and the reality is ‘the fundamental interpretative hypothesis’, which states ‘that events with zero or low probability are unlikely to occur’ (Shafer and Vovk, ‘Probability and finance: It’s only a game’, pp.5) This makes clear that conclusions drawn from probabilistic inference about the reality make sense only if we accept this hypothesis, and as a indirect consequence, provides theoretical basis for the generalization results in the statistical learning context. (E.g., A is better than B on a set of i.i.d. examples => A is likely to be better than B on all cases otherwise something with small probability will happen.)

If you go this far, you can also question the validity of mathematical axioms, and Leo Pape would probably love to discuss that with you ;)

However, regarding probability theory, I thought that (IIRC) Cox's axioms were pretty solid, i.e. don't they (him, Jaynes and others, see Jaynes 2003) show that probability theory is entailed properly (within some parameter set) from desirable principles, and if those principles are not used, we can get into trouble?

When looking at the theory of Solomonoff induction (or the RL extension AIXI), I found such clearly stated linkage is somehow missing. Shane’s negative results show that we cannot say much mathematically about the performance of a complex predictor. However, in the original paper, it is suggested that the creation of AGI would be more or less an experimental science.

(Shane, if you're reading, is that the reason why you went to neuroscience?)
Personally, I don't think there is a need to consider AGI only as an experimental science.
There are still many questions to answer. Maybe performance is not the right criterion, or maybe it is but we need a different framework were we have different properties.
It is also clear that computable agents will perform much poorer than AIXI in the real world (if that makes sense to consider AIXI in the real world).
Legg's results are a start, but I believe this is far from the end of the story. We need more results, in different situations, etc.

However, I also found this point problematic. One of the reason is that almost all experimental study requires the comparison between two agents A and B. And if A and B are complex algorithmic objects, and if we adopt Shane’s definition that a predictor is universal if it ‘eventually’ correctly predict a sequence, then there is practically no way to draw any conclusion by looking at the performance of the agents on finite data. As a result, to make an experimental study useful, we have to answer precisely the following question:

What is the fundamental hypothesis we must accept to say that ‘Predictor A is better than B on the given set of finite strings => A is likely to be better than B on all sequences’?

Why couldn't we compare agents theoretically?
Also, infinite sequences are of interest, even if we end up saying, for example, that some agent cannot learn any infinite sequence for some reason.

So an agent can  perform very poorly on some sequences/strings, yet perform well on some others.
A computable agent cannot perform well against some kind of adversarial environments? Well, maybe this kind of environments are too specific (maybe we can measure this specificity with the mutual information of the agent and the environment: if they are too much alike, the agent might perform poorly), but the agent can still perform very well in highly complex environments that are not adversarial.

In other word, we must state very clearly under what hypothesis should we belief that the experimental result will generalize, before we start to do any experiment. I don’t have an answer to this problem, yet I think the answer is necessary if we want to make the universal theory practically relevant.

I very much agree that we need to state clearly what hypothesis we should believe in.
Peter Sunehag is trying to axiomatize rational RL: http://www.hutter1.net/official/bib.htm#aixiaxiom
That's a nice start.

What would be marvelous is a set of hypothesis that entail the AGI theory (potentially based on Solomonoff Induction), along with proofs that outside these boundaries, some good properties cannot hold.
One such results is Hutter's "only agents based on bayes mixtures can be Pareto optimal":
Hutter's book 2005, and, I think, http://www.hutter1.net/ai/optisp.htm

Laurent

### Rohrer, Brandon R

Aug 23, 2011, 5:15:44 PM8/23/11
Hi Laurent,

I'm enjoying the discussion here. I want to draw attention to a statement you made in your last post:

It is also clear that computable agents will perform much poorer than AIXI in the real world (if that makes sense to consider AIXI in the real world).

I'll try to contradict it, and you can tell me if I'm wrong.

My understanding is that AIXI is provably optimal (given that we can agree on the Turing machine used to express plans, accept the specific formulation of the Solomonoff predictor as optimal, and agree on a horizon) across all possible RL problems. But the real world, by which I assume you mean the physical world, will not present a uniform sampling of all possible RL problems to an agent. In fact, it is quite possible that the real world will only present an infinitessimal subset of all possible RL problems to the agent. In this case, an agent may be able to far outperform AIXI on physical world RL tasks, yet still may be considered "general" in the sense that it can do anything that we ourselves are likely to do or need done.

Are there some holes in this that I'm not seeing?

Brandon

### Abram Demski

Aug 23, 2011, 8:45:55 PM8/23/11
Rohrer,

An answer (partly joking, but true): that's just a matter of reference machine choice. We can choose a reference machine which generates RL problems in a way that is distributed very much like our world, and talk about how well *that* formulation of AIXI would do in comparison to a finite agent.

--Abram

--
To post to this group, send email to magic...@googlegroups.com
To unsubscribe from this group, send email to
For more options, visit this group at

### Abram Demski

Aug 23, 2011, 8:50:44 PM8/23/11
Hi,

I think one answer to this is: elegant theories usually win. That's the fundamental connection between Solomonoff induction and real-world experience, and it is actually quite related to the "fundamental interpretive hypothesis" you mention for probability theory. Whereas that hypothesis is that low-probability events will rarely be encountered, this one is that elegant things will often be encountered.

--Abram

### Laurent

Aug 24, 2011, 4:27:13 AM8/24/11
Hi Brandon,

> It is also clear that computable agents will perform much poorer than AIXI in the real world (if that makes sense to consider AIXI in the real world).

I'll try to contradict it, and you can tell me if I'm wrong.

My understanding is that AIXI is provably optimal (given that we can agree on the Turing machine used to express plans, accept the specific formulation of the Solomonoff predictor as optimal, and agree on a horizon) across all possible RL problems. But the real world, by which I assume you mean the physical world, will not present a uniform sampling of all possible RL problems to an agent. In fact, it is quite possible that the real world will only present an infinitessimal subset of all possible RL problems to the agent. In this case, an agent may be able to far outperform AIXI on physical world RL tasks, yet still may be considered "general" in the sense that it can do anything that we ourselves are likely to do or need done.

So, yes, in a sense, we might do better than AIXI in our real world, that's right.
If you consider one single task, like adding 32 bits numbers, then certainly we can do better than AIXI, since we might even be able to prove that we have the best algorithm. But that is limited to very simple tasks, and the world is much more complex.

And:

- Today, I tend to think that our world exposes us often to (some) simple "sub-environments" (all goes as expected, everything is predictable), and seldom to more complex sub-environments. This distribution reflects Occam's razor, except than instead of considering a diversity of environments, we consider one single environment composed of multiple sub-environments. Apparently, you seem to refer to that too, if I'm not mistaken.

- Our real world looks Turing-complete (save for some huuuge memory bound), since we can create computers. Thus virtually all environments can be expressed in this framework, expect the veeery complex ones maybe, but note that the prior of these environments is small.
So it's probably not infinitesimal.
Your argument might be better in a simpler environment than the real world.

- AIXI does not need the world to be a composition of environments. It works great in any environment. For Solomonoff Induction, a computable agent can only save a finite amount of prediction mistakes in any environment.
(This is not completely true for AIXI, see http://www.springerlink.com/content/p2780778k054411x/ and http://www.hutter1.net/official/bib.htm#asyoptag but that problem may arise in computable agents as well, as it is a problem of exploration, not of exploitation)
I.e., even if you choose the environment, and the best agent for that environment, Solomonoff Induction will learn to predict like that agent in less errors than the size of the smallest program that computes the environment. (For the real world, "all we need" for the smallest program are the laws of physics and the initial state at the big-bang, for example. Much less is required in fact, as many things do not directly depend on distant stars).

- AIXI discards all environments that are not consistent with its experience. So it (probably) suffices for AIXI to throw one single die to "infer" many things about classical mechanics, and maybe quantum dynamics and Newton's laws of gravitation. With a few more throws it could generalize the laws very quickly. It does not need to observe the sand and the stars to infer that the world might be composed of particles (or that it leads to a good model). Learning is amazingly fast.
The knowledge it gains about the world grows so quickly that even if you add some additional initial knowledge to a computable agent, AIXI should catch up in no time.

- last but not least, I was considering *real-time* real-world agents, those which can only do a very limited number of computations per interaction step, whereas AIXI can do an infinite number of computations in the same time. For such real agents, learning can only be extremely slow, since this requires time.

So I still think that AIXI would be vastly more intelligent in the real world than the real agents we will be able to create.

Laurent

### Sun Yi

Aug 24, 2011, 5:11:26 AM8/24/11

Hello,

My name is SUN Yi, Yi is the first name: )

From my point of view:

- There is no problem about the linkage between the ideal Solomonoff induction and the reality, except it is not computable : )

If we make the assumption that the environment is computable (which is a religious thing since we cannot mathematically reason about the computability of the world in which we live, accept it or not...), then we can make the precise prediction with bounded regret.

- My problem is with the computable agent. Shane’s result shows that even if we make the assumption that the world is simple in the Kolmogorov sense, we still cannot guarantee the performance of the agent. Moreover, beyond certain point we cannot even mathematically reason about the computable agent.

- As for the comparison of two agents, I have the intuition that it is not decidable (no proof yet). Anyway, I wrote the following Lemma, basically restate Shane’s result. However, it shows that deciding if a sequence is predicted by a given predictor is very hard.

Regards,

Sun Yi

Sent: Tuesday, August 23, 2011 21:51

--

image001.jpg

### Rohrer, Brandon R

Aug 24, 2011, 5:24:43 AM8/24/11
Laurent,

____________________
- Our real world looks Turing-complete (save for some huuuge memory bound), since we can create computers. Thus virtually all environments can be expressed in this framework, expect the veeery complex ones maybe, but note that the prior of these environments is small.
So it's probably not infinitesimal.
Your argument might be better in a simpler environment than the real world

>>True. If you meant to include the computers we create when you cited the "real world" then my comment isn't  relevant.

- AIXI discards all environments that are not consistent with its experience. So it (probably) suffices for AIXI to throw one single die to "infer" many things about classical mechanics, and maybe quantum dynamics and Newton's laws of gravitation. With a few more throws it could generalize the laws very quickly. It does not need to observe the sand and the stars to infer that the world might be composed of particles (or that it leads to a good model). Learning is amazingly fast.
The knowledge it gains about the world grows so quickly that even if you add some additional initial knowledge to a computable agent, AIXI should catch up in no time.

>>You seem to have an intuition about how an AIXI would perform in certain situations. I don't have this, or I should say, my intuition is different. But since my intuition based on no empirical evidence, I don't give it much weight. My impression was that AIXI never discards an environment, but just considers them less likely when they are contradicted. The notion that a single observation could be used to definitively rule out certain possibilities seems strange to me. Why would that be so? Does Solomonoff Induction make no allowance for inaccuracies in observations? If so, that seems like another major barrier to implementing a variant of AIXI.

- last but not least, I was considering *real-time* real-world agents, those which can only do a very limited number of computations per interaction step, whereas AIXI can do an infinite number of computations in the same time. For such real agents, learning can only be extremely slow, since this requires time.

>>Hardly a fair comparison! :)

So I still think that AIXI would be vastly more intelligent in the real world than the real agents we will be able to create.

>>It would be great to be able to make such a comparison in practice.

Brandon

### Abram Demski

Aug 24, 2011, 2:30:31 PM8/24/11
Brandon,

One point below.

On Wed, Aug 24, 2011 at 2:24 AM, Rohrer, Brandon R wrote:
Laurent,

____________________
- Our real world looks Turing-complete (save for some huuuge memory bound), since we can create computers. Thus virtually all environments can be expressed in this framework, expect the veeery complex ones maybe, but note that the prior of these environments is small.
So it's probably not infinitesimal.
Your argument might be better in a simpler environment than the real world

>>True. If you meant to include the computers we create when you cited the "real world" then my comment isn't  relevant.

- AIXI discards all environments that are not consistent with its experience. So it (probably) suffices for AIXI to throw one single die to "infer" many things about classical mechanics, and maybe quantum dynamics and Newton's laws of gravitation. With a few more throws it could generalize the laws very quickly. It does not need to observe the sand and the stars to infer that the world might be composed of particles (or that it leads to a good model). Learning is amazingly fast.
The knowledge it gains about the world grows so quickly that even if you add some additional initial knowledge to a computable agent, AIXI should catch up in no time.

>>You seem to have an intuition about how an AIXI would perform in certain situations. I don't have this, or I should say, my intuition is different. But since my intuition based on no empirical evidence, I don't give it much weight. My impression was that AIXI never discards an environment, but just considers them less likely when they are contradicted. The notion that a single observation could be used to definitively rule out certain possibilities seems strange to me. Why would that be so? Does Solomonoff Induction make no allowance for inaccuracies in observations? If so, that seems like another major barrier to implementing a variant of AIXI.

My guess is that this confusion comes from the existence of two different formulations of solomonoff induction.

The simpler formulation is a mixture model of all computable deterministic models. In this case, an individual model makes hard predictions about the future, and we can simply discard it if it turns out to be wrong. Bad models will initially be eliminated very very quickly.:

The mixture model can also be formulated as a combination of all computable probability distributions. In this case, models make soft predictions, so we are adjusting their relative probabilities rather than discarding them.

(I should provide a reference for the equivalence, but I'm not sure where it is proven...)

- last but not least, I was considering *real-time* real-world agents, those which can only do a very limited number of computations per interaction step, whereas AIXI can do an infinite number of computations in the same time. For such real agents, learning can only be extremely slow, since this requires time.

>>Hardly a fair comparison! :)

So I still think that AIXI would be vastly more intelligent in the real world than the real agents we will be able to create.

>>It would be great to be able to make such a comparison in practice.

Brandon

--
To post to this group, send email to magic...@googlegroups.com
To unsubscribe from this group, send email to
For more options, visit this group at

### Laurent

Aug 24, 2011, 2:49:16 PM8/24/11

On Wed, Aug 24, 2011 at 20:30, Abram Demski wrote:
Brandon,

One point below.

On Wed, Aug 24, 2011 at 2:24 AM, Rohrer, Brandon R wrote:
Laurent,

____________________
- Our real world looks Turing-complete (save for some huuuge memory bound), since we can create computers. Thus virtually all environments can be expressed in this framework, expect the veeery complex ones maybe, but note that the prior of these environments is small.
So it's probably not infinitesimal.
Your argument might be better in a simpler environment than the real world

>>True. If you meant to include the computers we create when you cited the "real world" then my comment isn't  relevant.

- AIXI discards all environments that are not consistent with its experience. So it (probably) suffices for AIXI to throw one single die to "infer" many things about classical mechanics, and maybe quantum dynamics and Newton's laws of gravitation. With a few more throws it could generalize the laws very quickly. It does not need to observe the sand and the stars to infer that the world might be composed of particles (or that it leads to a good model). Learning is amazingly fast.
The knowledge it gains about the world grows so quickly that even if you add some additional initial knowledge to a computable agent, AIXI should catch up in no time.

>>You seem to have an intuition about how an AIXI would perform in certain situations. I don't have this, or I should say, my intuition is different. But since my intuition based on no empirical evidence, I don't give it much weight. My impression was that AIXI never discards an environment, but just considers them less likely when they are contradicted. The notion that a single observation could be used to definitively rule out certain possibilities seems strange to me. Why would that be so? Does Solomonoff Induction make no allowance for inaccuracies in observations? If so, that seems like another major barrier to implementing a variant of AIXI.

My guess is that this confusion comes from the existence of two different formulations of solomonoff induction.

The simpler formulation is a mixture model of all computable deterministic models. In this case, an individual model makes hard predictions about the future, and we can simply discard it if it turns out to be wrong. Bad models will initially be eliminated very very quickly.:

The mixture model can also be formulated as a combination of all computable probability distributions. In this case, models make soft predictions, so we are adjusting their relative probabilities rather than discarding them.

(I should provide a reference for the equivalence, but I'm not sure where it is proven...)

You can find that in Hutter 2005 at least, probably in Li&Vitanyi too.

Brandon, when I say "discard an environment", I'm talking about an environment as a program p, so that on input y_{1:t} (the history of the agent's actions from time 1 to t), p(y_{1:t}) produces the output x_{1:t} (observations of the agent).
If the output produced by this program is different from the true output as received by the agent, then this particular environment/program is no longer consistent and gets discarded.
However, it is easy to construct a new program p' based on p that remains consistent with the history. In general, though, p' is more complex than p.
This means that as the agent interacts with the environment, there is *always* an infinite variety of environments that are consistent with the history, and there are always environments that predict whatever you want (i.e. any future is *possible*). However, Occam's razor makes some futures less and less likely, making the program representing the true environment more and more likely.

Laurent