Here's the abstract. I recommend that those interested in the approach
check it out.
Learning What to Value
Abstract: I.J. Good's theory of an "intelligence explosion" predicts
that ultraintelligent agents will undergo a process of repeated
self-improvement. In the wake of such an event, how well our values are
fulfilled will depend on whether these ultraintelligent agents continue
to act desirably and as intended. We examine several design approaches,
based on AIXI, that could be used to create ultraintelligent agents. In
each case, we analyze the design conditions required for a successful,
well-behaved ultraintelligent agent to be created. Our main contribution
is an examination of value-learners, agents that learn a utility
function from experience. We conclude that the design conditions on
value-learners are in some ways less demanding than those on other
design approaches.
- http://www.danieldewey.net/dewey-learning-what-to-value.pdf
--
__________
|im |yler http://timtyler.org/ t...@tt1lock.org Remove lock to reply.
Why is it, that a guy like that can write an entire paper about
reinforcement learning machines, and not once mention "reinforcement
learning" or make reference to any of the work done in the field? Does he
even understand he's talking about building/defining reinforcement learning
machines?
His Apendeix B - ":Ultraintelligent Reward Maximizers Behave Badly" I
agree with. He argues reward maximizer will succumb to the wirehead
problem if they become too intelligent.
Sadly, what he seems to have failed to realize, is that any actual
implementation of an O-Maximizer or his Value-learners must also be reward
maximizer. Is he really that stupid so as not to understand they are all
reward maximizer?
About O-Maximzers, he writes:
"Like AIXI, this agent acts to maximize an expected value, ..."
The only difference is in the algorithm it uses to calculate the "expected
value". Dose he not understand that if you build a machine to do this,
that there must be hardware in the machine that calculates that expected
value? And that such a machine can then be seen as two machines, one which
is calculating the expected value, and the other which is picking actions
to maximize the output of that calculation? And once you have that
machine, his argument of appendix B once again applies?
And his value-learning machine is the same thing. The only difference is
that it's value calculator (reward calculator) is driven by a different
reward calculating function.
It seems to me, that since he his approaching the problem of reinforcement
learning from such a high level mathematical approach, that he has lost
track of the reality that these things can't just be specified
mechanically. They have to actually be built. And once you build them,
then all the hardware is not just "part of the agent". It's also part of
the environment - and as such, free to be modified by the agent (self
modification).
Though I feel he is thinking clearly in his wire heard argument, it seems
to me that since reward maximizing as Hutter talks about it apparently, and
as it's normally talked about in RL work, is working to maximize a reward
coming from the environment. So I guess this gives Dewey the freedom to
think in terms of the agent being free to modify the reward generating
system instead of being a slave to it. But the same must always be true
for any and all types of value maximizing machines. The value must be
calculated by hardware, and whether you choose to call that hardware part
of the environment, or part of the agent, makes no difference what so ever.
An agent of unbounded intelligent will always reach a point of
understanding he has the option to try and modify the reward function which
means the wirehead problem is always on the table.
There are many ways I can think of the minimize the impact of the wirehead
problem, but I can't think if a single way to create intelligence that has
no potential wirehead problem.
--
Curt Welch http://CurtWelch.Com/
cu...@kcwc.com http://NewsReader.Com/
>> - http://www.danieldewey.net/dewey-learning-what-to-value.pdf
>
> Why is it, that a guy like that can write an entire paper about
> reinforcement learning machines, and not once mention "reinforcement
> learning" or make reference to any of the work done in the field? Does he
> even understand he's talking about building/defining reinforcement learning
> machines?
AIXI is, fundamentally, a reinforcement learning agent.
> His Apendeix B - ":Ultraintelligent Reward Maximizers Behave Badly" I
> agree with. He argues reward maximizer will succumb to the wirehead
> problem if they become too intelligent.
>
> Sadly, what he seems to have failed to realize, is that any actual
> implementation of an O-Maximizer or his Value-learners must also be reward
> maximizer. Is he really that stupid so as not to understand they are all
> reward maximizer?
It does appear to me that the author *probably* has a misconception along
these lines. Though this is an area where we disagree with each other.
> About O-Maximzers, he writes:
>
> "Like AIXI, this agent acts to maximize an expected value, ..."
>
> The only difference is in the algorithm it uses to calculate the "expected
> value". Dose he not understand that if you build a machine to do this,
> that there must be hardware in the machine that calculates that expected
> value? And that such a machine can then be seen as two machines, one which
> is calculating the expected value, and the other which is picking actions
> to maximize the output of that calculation? And once you have that
> machine, his argument of appendix B once again applies?
>
> And his value-learning machine is the same thing. The only difference is
> that it's value calculator (reward calculator) is driven by a different
> reward calculating function.
Yes, maybe. I don't agree that all such systems wirehead, but I do think
ones like this are pretty likely to. My own examination of similar systems
suggests to me that they are likely to wirehead, unless special care is
taken:
http://matchingpennies.com/wirehead_analysis/
> It seems to me, that since he his approaching the problem of reinforcement
> learning from such a high level mathematical approach, that he has lost
> track of the reality that these things can't just be specified
> mechanically. They have to actually be built. And once you build them,
> then all the hardware is not just "part of the agent". It's also part of
> the environment - and as such, free to be modified by the agent (self
> modification).
>
> Though I feel he is thinking clearly in his wire heard argument, it seems
> to me that since reward maximizing as Hutter talks about it apparently, and
> as it's normally talked about in RL work, is working to maximize a reward
> coming from the environment. So I guess this gives Dewey the freedom to
> think in terms of the agent being free to modify the reward generating
> system instead of being a slave to it. But the same must always be true
> for any and all types of value maximizing machines. The value must be
> calculated by hardware, and whether you choose to call that hardware part
> of the environment, or part of the agent, makes no difference what so ever.
I think there is *some* difference - in that if you make the critic part
of the
agent, you can wire in an expected utility maximisation framework - and
the goal. Whereas if the critic is part of the environment, the agent has
to reverse-engineer the goal from a reward signal, and figure out the
details of expected utility maximisation for itself.
It may not make too much difference to a wirehead analysis, though -
as you say.
Yeah, I'm aware of that. And I know Hutter understands that. It was hard
for me to grasp if Dewey understands that from that short article though.
Yeah, and I think there are many many ways to work around the wirehead
problem, and I think evolution will make sure those solutions flourish in
any AI society. Whether it ultimately puts an effective limit on the
intelligence of an individual or not I'm not sure. It's an arms race. If
the wirehead work-arounds can keep ahead of the growing intelligence, then
the intelligence will continue to grow and not die out from wireheading
disease. Otherwise, the intelligence growth will have to stagnate until
the wire-head work-arounds catch up. If there is a ceiling on what sort of
wirehead work-arounds can evolve, then that will put a ceiling on how
intelligent the agents can become.
> > It seems to me, that since he his approaching the problem of
> > reinforcement learning from such a high level mathematical approach,
> > that he has lost track of the reality that these things can't just be
> > specified mechanically. They have to actually be built. And once you
> > build them, then all the hardware is not just "part of the agent".
> > It's also part of the environment - and as such, free to be modified by
> > the agent (self modification).
> >
> > Though I feel he is thinking clearly in his wire heard argument, it
> > seems to me that since reward maximizing as Hutter talks about it
> > apparently, and as it's normally talked about in RL work, is working to
> > maximize a reward coming from the environment. So I guess this gives
> > Dewey the freedom to think in terms of the agent being free to modify
> > the reward generating system instead of being a slave to it. But the
> > same must always be true for any and all types of value maximizing
> > machines. The value must be calculated by hardware, and whether you
> > choose to call that hardware part of the environment, or part of the
> > agent, makes no difference what so ever.
>
> I think there is *some* difference - in that if you make the critic part
> of the
> agent, you can wire in an expected utility maximisation framework - and
> the goal. Whereas if the critic is part of the environment, the agent
> has to reverse-engineer the goal from a reward signal, and figure out the
> details of expected utility maximisation for itself.
Well, that sort of stuff just gets wrapped up in implantation details to
me. There are a TON of different ways to implement reward maximizing
machines and some implementations my tie things together in a way that the
reward value might be nearly impossible to separate out and wirehead
without at the same time, breaking the basic reward maximizing technology
at the same time - aka killing itself before it gets the "wirehead reward".
If such an implementation is possible, it would fall under the heading of
"wirehead-workaround" solutions.
> It may not make too much difference to a wirehead analysis, though -
> as you say.
I do however, agree very much with the most important point of the paper -
which is the value learning machine. Though I describe what it's doing in
a somewhat different light.
I believe humans are already reinforcement learning machines that learn all
our behaviors and beliefs as ways of maximizing an internal reward. Human
ethics and morals are so hard to understand, because they are mostly
learned based on experience from a complex life. There is simple way to
summarize the ethics of a single human (because their total ethics are
encoded in all their behaviors - encoded in the entire current wiring of
their brain).
To attempt to make a reinforcement learning AI robot that would always be
correctly motivated to follow the desires (morals and ethics) of any single
human, would require us to encode into the robots reward function, the
entire wiring and sensory system of the human. To make it follow the
morals and ethics of a billion human society, would requires us to wire the
behavior function of a billion brains, into the robot.
And worse yet, those human brains are constantly changing by experience, so
the robot reward function should have to be constantly updated.
But the easier way to make sure the robot is working to keep humans happy,
is to tap into the human reward signal (implant some sensors in the brains
of the humans), and broadcast it to the robots - and make them attempt to
maximize some combined measure of all the humans reward signal. That would
make the robots the "value learning" machine Dewey is talking about, though
the implementation could be somewhat different than his specific
mathematical model.
But it also makes it easy for the robots to wire-head themselves just by
blocking, or creating fake - human reward signals. So ways to protect that
human to robot reward signal from being wireheaded would have to be found
if the robots that were controlled that were were anywhere near smart
enough to do that sort of wireheading.
I didn't try to follow his math to closely (I didn't study his formulas to
the point of being sure I understood them), but my quick read was that the
formulas only worked, if the machine had the power to correct predict all
possible outcomes (universes) for K steps into the future, and then average
over all those the expected value of each to pick the best current "move".
That's a typical board game technique - look at all moves K steps into the
future approach. That sort of approach is competently unworkable for AI
(no single AI will every be able to predict what 100 other AIs of the same
size and complexity in it's environment will do for K steps into the future
where K is large enough to be useful). Not to mention just trying to
predict physics, such as whether it will rain tomorrow at the same time.
No practical AIs will work by running simulations of the future with every
"move" it makes.
What they do instead, is learn what current choice has shown itself to work
beast in the current context, based on past experience, and to pick that
choice, for now. The low level hardware doesn't do any look-ahead. It
doesn't need to. But the trick to making that work, is creating a
"context" which is highly predictive of what is about to happen, so that an
action selected on the context that worked in the past, is likely to be
useful again (because the context as defined by the AI is doing a good job
of predicting what will happen next).
The only prediction the context needs to make, is the expected future
reward based on past experience with this same context. All behaviors, are
judged for for its worth, based on how well it did relative to the
prediction. Future rewards are all these machines need to predict - they
don't need to predict how the environment will change in the next hour or
year. The fact that we at the high level have many powers to do that, is a
demonstration of some our high level learned behaviors, and not a
demonstration of how the low level hardware is doing it's primary reward
maximizing job. At least that's how I see it. :)
> I didn't try to follow his math to closely (I didn't study his
formulas to
> the point of being sure I understood them), but my quick read was
that the
> formulas only worked, if the machine had the power to correct predict all
> possible outcomes (universes) for K steps into the future, and then
average
> over all those the expected value of each to pick the best current
"move".
> That's a typical board game technique - look at all moves K steps
into the
> future approach. That sort of approach is competently unworkable for AI
> (no single AI will every be able to predict what 100 other AIs of the
same
> size and complexity in it's environment will do for K steps into the
future
> where K is large enough to be useful). Not to mention just trying to
> predict physics, such as whether it will rain tomorrow at the same time.
> No practical AIs will work by running simulations of the future with
every
> "move" it makes.
FWIW, that is - more or less - how I expect machine intelligences to work.
The things to realise are:
* The machine is likely to be predicting future sense inputs, using
induction
based on current sense data. Rather different from predicting physics.
* The machine may not be working with raw sense data. Sense data is
converted to various high-level abstractions after being input. Induction
could be predicting streams at any one of those heirarchical levels.
* Action sequences can be clumped together (habits), so the machine may
not evaluate on every timestep - just when it needs to decide what to do.
Bear this lot in mind, and I think the strategy looks more practical.
> The only prediction the context needs to make, is the expected future
> reward based on past experience with this same context. All
behaviors, are
> judged for for its worth, based on how well it did relative to the
> prediction. Future rewards are all these machines need to predict - they
> don't need to predict how the environment will change in the next hour or
> year.
You don't normally know what reward you will be getting until you know
what your future circumstances are, though.
That is why your brain contains an elaborate world simulator, which is
constantly predicting the future consequences of your possible actions.
Except for those loose cannon fools out there.
Yes and know. Unfortunately, to fully predict future senses, you have to
fully predict the physics of your environment. You throw a ball, and watch
it bounce around the room, and in order to predict where it will end up and
what things it will knock over and change, you not only have to predict
physics, you have to have a better than perfect understanding of the state
of the room. Basically, it's impossible. No machine that can fit in that
room, will be able to predict where the ball will be in 20 seconds for a
typically complex room full of complex stuff.
Our brain doesn't tell us where the ball will be in 20 seconds. It gives
us a very rough idea that if the ball is moving to the right now, it will
_likely_ be fuhrer to the right, in .1 seconds from now. If we see it
heading towards a wall, we predict the bounce that will happen in .1 second
from now. If we see it heading towards a random stack of books, we predict
we won't be able to predict which way it's about to bounce.
But our predictions are very limited. What we can predict, is almost
insignificant compared to all we can't predict. But the little that we can
predict, and makes it worth our time to have a brain, is enough to give us
an important edge on survival. Our ability to predict what we can't
predict is nearly as important as our ability to predict the little we can.
> * The machine may not be working with raw sense data. Sense data is
> converted to various high-level abstractions after being input.
> Induction could be predicting streams at any one of those heirarchical
> levels.
Yes, and the important aspect of how those abstractions are selected, are
due to their temporal predictive powers. The abstractions that are the
best predictors, are the ones that get built. So even though there is not
much we can predict, all the stuff that we can predict something about
(often better thought of as just predicting a probability distribution), is
what the brain builds detectors for. We learn to "see" and make use of,
the patterns in the sensory data that are predictive of each other. And in
the hierarchy, each level is making predictions about the previous levels.
It builds a hierarchy to extract as much "easy to predict" information out
the sensory stream as it can find. And then it leverages all those
predictive features of the environment to make reward maximizing action
decisions. It allows us to recognize there is food that will give us a
reward if we can get it in our mouth, and it leverages the predictable
actions of our arms and hands and the food, to produce a sequence of
actions that ends with the food in our mouth.
But the stuff it could not predict (like how many crumbs might fall off the
cupcake before we get it into our mouth), or who might walk into the room
and distract us from our goal of the food, far outnumbers the little stuff
we could predict. But the fat that 99.999% of what the atoms in the room
an in our environment were going to do next could not be predicted, didn't
change the fact that what we could predict, was good enough to allow us to
make that food get into our mouth.
> * Action sequences can be clumped together (habits), so the machine may
> not evaluate on every timestep - just when it needs to decide what to do.
>
> Bear this lot in mind, and I think the strategy looks more practical.
Yes, but I think you are assuming more than what is there.
It's not just that _some_ actions can be clumped together. They ALL are.
Our brain becomes conditioned to be billions of strategies for dealing with
our environment. We have a set of behaviors that make our arm move to get
the hand to the mouth which gets triggered by food in our hand. We learned
other tricks of making the arm and hand move to pick up food, because once
we did that, the "hand to mouth" behaviors take over and lead to a reward
when the food gets to the mouth. Once those sorts of strategies are in our
behavior set, then we learn the strategies of using our legs to get us
close to some food we see a few feet away from us. Because once we care
close though, the "grab food" and the "stick food in mouth" behaviors take
over and get that reward for us.
All these learned beahviors learn to take advantage of each other to build
a huge set of learned strategies for dealing with our environment.
Everything "clumps together" with each other as we produce each
mirco-behavior as a reaction to current perceived condition of our
environment.
Learning to use language, and to allow ourselves to be controlled by, and
to produce it in certain patterns is all apart of that growing large list
of billions of strategies we have learned for how to react to our perceived
environment.
Though we have lots of short term predictive powers that allow us to guess
what is likely to happen next, and though we take advantage of them in our
thinking process, or low level hardware is not doing that at all. It's just
selecting behaviors by direct-look up in a large associative memory system
that holds all our past conditioned beahviors. Conditioning works by
pre-calculating what we should do NEXT TIME so that when we are in the
situation again, we know instantly "what to do" without having to "think
though" or "calculate" anything. The low level system producing our
intelligent behavior, is just a associative memory system selecting actions
by direct look up based on the current context of the environment. All our
behavior is just a collection of "habits" we have learned.
However, our brain also has an interesting feature which allows for our
private thoughts. Some (generally small) amount of the perceptual context
that exists seems to be driven from the top down so that learned behaviors
have some control over our perception - which then regulates what we "do
next". So we learn to talk, but our internal talk hardware is not limited
to driving only our lips - it can drive some our perceptional system
through some internal top-down feedback that allows us to perceive our out
talking, even though we didn't move our lips. As well as perceive things
like visual effects that would have resulted if moved our arms, but didn't
move our arms, etc.
This ability to generate these "fake" perceptions, is just enough feature
of the environment the learning hardware learns to take advantage of. We
learn how to manipulate our thoughts to our advantage the same way we learn
to manipulate our arms and legs to our advantage - by conditioning.
Our power to manipulate our perception system however is biased by what the
perception system has been trained to predict. How it expects sensory
perceptions to change over time. If we we visualize a light bulb dropping
to a concrete floor, our perception system makes us see (and hear) it
shatter, not splash like a big drop of water would. So by using learned
behaviors to drive our perception in that way, we can use the predictive
powers of our perception system to our advantage - to predict what will
happen, before we actually do it.
But our perception system is only making predictions based on what it has
seen happen in the past, so it's more accurately thought of as a complex
"memory" system that we can probe with our learned "memory probing"
behaviors (aka "thinking").
So though a large amount of our memory context is controlled by the flow of
sensory data, some amount is also controlled by this feedback
memory-probing ability we have. How we act, is a result of a look-up based
on that full context - the context set up by the environment plus the
context set up by our current "memory probes". But whether the brain is
choosing to make our arms move, or choosing to use the memory-probe
feature, is all just conditioned responses learned from a life time of
experience. The brain doesn't run some future-emulation to decided between
an arm movement and a memory-probe action, it does a direct lockup which
picks the action based on context very quickly.
When we do things like "think about the possible outcomes in this chess
position" it's nothing but a conditioned response we learned that triggers
a sequence of these internal memory probe events, and each memory probe
action (along with the entire current sensory context) triggers the next
behavior, whether that be more memory probing, or moving the arm to make a
chess piece move.
Our ability to plan and search alternatives for the future is not the low
level decision hardware at work, it's a high level learned behavior created
by stringing together a large set of low level conditioned responses
(learned habits).
Yes you do. It's back calculated in advanced. Not forward calculated when
needed. This is what RL research has shown us.
TD-Gammin for example uses a neural network to calculate the expected
probability of winning, from any game board. In other words, it's
calculating the expected future reward from _any_ game position. It gets a
reward value of 1 for winning, and 0 for loosing. So this function is
estimating future reward, and since the reward is 0 to 1, the estimation
becomes 0 to 1 (which makes it the same as the probability of winning). It
does it by doing a direct function calculation on the current board
(current state of the environment), not by searching the game tree (not by
estimating possible future states).
The function is created by playing games, and seeing how the state changes.
When a real reward is received, all the past game board positions that
happened in that game, is fed to the neural network, and the neural network
function is adjusted a little bit towards an output value of 1 (the real
reward). When the game is over, and the reward is zero, the same thing
happens, except the function is adjusted to produce an answer closer to 0.
After playing millions of games, the function converges on an expected
value function - the expected value of winning from any game position.
How well this approach works, all depends on how good the function is, at
being able to predict a win or loss given the current state. It doesn't
have to produce an absolute answer. It only has to produce a probably.
But that probability must be accurate enough to cause reasonably good
action decisions to be made (aka least better than random decisions).
Such an approach is only useful, if the current state is a good predictor,
of how to act to maximize the odds of future rewards. But for our
environment, that's generally true - at least for short term actions. But
it requires the perception system transform the sensory environment
signals, to the features that are the best predictors.
> That is why your brain contains an elaborate world simulator, which is
> constantly predicting the future consequences of your possible actions.
It doesn't. It just acts, and then back propagates rewards to the previous
actions so as to create good estimations of which actions are most likely
to produce the most future rewards. And then it selects those actions over
all the other options.
Our perception system is built to create a good context based on
predictions. And as such, through experience, it learns to predict the
value of different actions in different contexts.
We can fully predict all possible futures in a chess game because the
environment is trivially simple. But even with full power of prediction,
it takes computers many times larger than the simple associative look up
system of the human brain to search even a small part of that space.
In the real world, we can't begin to predict how the world will change
accurately enough to forward-calculate actions based on simulations of the
future. We instead, use all our past experience combined as that database
to predict not what the world will do, but only the odds of getting
rewards. Predicting the odds of rewards, is a far simpler task, than
predicting how all the sensory data will change. But if we can make any
useful prediction of future rewards, it will cause our behaviors to be
biased towards more useful actions - aka higher odds of future rewards.
And even a slight bias in our actions towards "more rewards" becomes
useful.
How good we are at picking useful behaviors has nothing to do with our
power to predict how the world will change in the future. It's only a
function of how much related experience we have, so that our innate gut
feelings about what is 'better" will be more accurate.
The little powers we have to actually predict the future, is mostly just a
side effect we got for free. Not the foundation of how we make decisions.
In other words, the system doesn't need to know why a given action will
lead to more rewards, it only needs to know that it will. When you watch a
system driven in this way, it "looks" as if it "predicting the future".
For example, we see TD-Gammon move, and we rationalize it's choice by
saying something like, "look, it's moving there because it knows it needs
to protect it's piece from being taken". But in fact, it had no such
knowledge at all. It moved there because that move "felt" best to it.
Our ability to use language to rationalize our own powers is a learned
behavior. It's not the source of the decision - except in those cases
where our rational language behaviors end up being the major controlling
factor of a given choice we make.
I let the global brain make all the hard rational decisions so I can just
be a loose cannon! :)
> > You don't normally know what reward you will be getting until you know
> > what your future circumstances are, though.
>
> Yes you do. It's back calculated in advanced. Not forward
calculated when
> needed. This is what RL research has shown us. [...]
> > That is why your brain contains an elaborate world simulator, which is
> > constantly predicting the future consequences of your possible actions.
>
> It doesn't. It just acts, and then back propagates rewards to the
previous
> actions so as to create good estimations of which actions are most likely
> to produce the most future rewards. And then it selects those
actions over
> all the other options.
I am not too impressed by these responses. They make it seem as
though you need to try harder to find sympathetic interpretations
of other people's comments.
>> - http://www.danieldewey.net/dewey-learning-what-to-value.pdf
>
> About O-Maximzers, he writes:
>
> "Like AIXI, this agent acts to maximize an expected value, ..."
>
> The only difference is in the algorithm it uses to calculate the "expected
> value". Dose he not understand that if you build a machine to do this,
> that there must be hardware in the machine that calculates that expected
> value? And that such a machine can then be seen as two machines, one which
> is calculating the expected value, and the other which is picking actions
> to maximize the output of that calculation? And once you have that
> machine, his argument of appendix B once again applies?
Rereading, I think this sort of thing was at least considered. The
author writes:
``It would be convenient if we could show that all O-maximizers have
some characteristic behavior pattern, as we do with reward maximizers
in Appendix B. We cannot do this, though, because the set of O-maximizers
coincides with the set of all agents; any agents can be written in
O-maximizer
form.''
So, it is claimed that O-maximizers don't *necessarily* behave like
the reward maximizers do. I expect the author would say that - if
any agents can avoid wireheading - O-maximizers can.
That's the way I figure it. Put it to work for you.
Yes, for sure.
Looking at the article with a bit more care, I can understand say I don't
understand what he's saying! Maybe you can explain this...
What's an outcome?
He writes "First, we posit an "outcome", an overall effect resulting from
all of the agent's interactions with the environment, denoted by r"
Ok, that's clear enough at an intuitive level. But he seems to use it
formally later, not just intuitively.
Then he defines r a member of the set R (all possible outcomes). Again,
easy enough.
But then he says an r is a partition of all possible _histories_ of the
universe. What is a universe and what is a history? He never defines it.
By using the word Universe, he seems to imply all possible ways the ENTIRE
UNIVERSE might play out. Not just what the agent my observer and effect.
Which is kinda oddly naive to think an agent actually has some ability to
change how the universe is going to "play out".
Or is he only saying all possible yx sequences?
But is the historical sequence from y1 x1 to yk xk? Or future estimated
sequences of yk xk to ym xm? Or all possible sequences from y1 x1 to ym
xm?
But then he includes the probability distribution P which is the
conditional probability of an given "outcome" r happening given a yx
sequence from 1 to m (meaning history plus future history). Now if r is a
partition of all possible yx sequences, then P is trivial, It's 1 or 0
depending on whether the sequence is part of the partition. Which doesn't
seem to be what he would be suggesting, so I guess we have to assume r is
in fact some history of the retiree universe (as if there could every be
more than one r period).
This just all strikes me as naive nonsense.
To build such an agent, the utility function U would have to created in
hardware. That could be easy because we are free to make up any U we want
which defines any set of r it wants to test for. We have the question he
addresses in the paper as to whether we could create a U that makes the
robot do what we want, but at least it's possible to build such a U.
But then there's that P function. I don't get the point of it. Either the
outcome has happened, as defined by the observations, or it hasn't. If you
can't know if an outcome has happened, then how you are going to estimate
the probability of it happening given a sensory history sequence?
All the robot really knows is the sensory sequence. By defining the
concept of an "outcome" (beyond what can be known from the sensory
sequence) he's only added compexity there is no point for. For any given
sensory sequence, he can simply define a new function Q, which is the sum
of all the products of UP over the set of all possible outcomes. This new
Q function becomes the new utility function that maps a history sequence to
a real value, without any concept of an outcome. He's added nothing useful
by define the concept of an outcome.
Actually, I see he defines my Q as the expected utility.
But this "expected utility" is nothing more than a fixed function that must
be specified by the designer.
And we see when he compares AIXI to his O Maximiswer, the only different in
the equations is the sum of the rewards for a sequence times algorithmic
probability, vs the "expected utility". Since the rewards are also just a
fixed fiction specified by the designer he hasn't changed anything at all.
He's just "renamed" the function that has to be specified by the designer.
Assuming "algorithmic probability" is so a function that can be calculated
(I don' know what it is, but if it's not a function that can be calculated,
then the agent can't be defined or built and this whole thing is mute).
So in the end, he's done nothing but changed the name of the reward
function that must be specified by the AI designer.
And as he goes on, he shows any agent can be written as as O maximizer by
giving it utility function which gives a 1 to every sequence the agent
would produce and 0 for every one it would not.
But at the same time, the exact same logic applies to a reward maximizer.
Give the reward maximizer a reward function of 1 for every y that matches
the agent and 0 for every y that does not match, and AIXI will act just
like the agent as well. Unless there is some hidden significance in the
concept of "algorithmic probability" that I am not aware of, his logic
about O-Maximizes applies equally well to reward maximizer.
I just don't "get" the logic of this paper.
> Actually, I see he defines my Q as the expected utility.
>
> But this "expected utility" is nothing more than a fixed function that must
> be specified by the designer.
>
> And we see when he compares AIXI to his O Maximiswer, the only different in
> the equations is the sum of the rewards for a sequence times algorithmic
> probability, vs the "expected utility". Since the rewards are also just a
> fixed fiction specified by the designer he hasn't changed anything at all.
> He's just "renamed" the function that has to be specified by the designer.
Having utility be an expected sum of future rewards (based on
previous assignments of rewards by a critic in a reinforcement
learning system and some kind of induction-based forecasting)
is a bit different from having it assigned by a specified utility
function.
In the latter case, the utility function can be arbitrarily specified
by the designer of the system.
For example, say I have three possible sense states (A,B,C), and 3 possible
actions (D,E,F) and three possible rewards (1,2,3).
Let's say my history consists of: A,D,1 - B,E,3 - A,E,3 - B,D,1.
A utility function could propose action D next - in response to
sense-data A - whereas a reward maximiser would most likely
figure that E has received consistently much better rewards than
D so far - so, it would probably be better to go with that.
> Assuming "algorithmic probability" is so a function that can be calculated
The reference to "Algorithmic probability" is talking about Solomonoff
induction - using a formal theory of induction to figure out the probability
of what is most likely to happen next, given your historical sense data.
The stuff about "universes" is needed to fit in with the formalism of AIXI.
It is pretty conventional to assign utilities to "outcomes" in the
"universe"
even though all you know about the world comes from instincts and
through your senses, which only sample a tiny fraction of it.
Yes, but the reward function is not part of the given environment. It's
specified by the designer just like the utility function is. And no matter
what algorithm the reward maximizer is using, you can always generate
rewards so as to manipulate the reward maximizer into picking the same
actions the utility function would make it pick. If you wanted it to pick
D, you wouldn't give E lots of rewards.
I'm not 100% true that this is totally valid logic - the reward maximizer
might not be so easily reversible in that sense. But the algorithm used by
the reward maximizer is part of what is given by the designer, and there's
nothing saying you can't choose to use a very simple reward maximizer.
However, in the case of AIXI, I think that algorithmic probability is
fixed, and not really so easily reversible for _EVERY_ action. I would
guess however that for any utility function, you could use rewards to TRAIN
the reward maximize to act the same as utility function - so it would only
be a question of how much training it would take before all the behavior of
the utility function was trained into the reward maximizer. SO you just
give it 1 reward when it make the decision the utility function would have
make it choose,and 0 for every mistake. And if it was a good enough
learner, it would in time converge on any utility function.
> > Assuming "algorithmic probability" is so a function that can be
> > calculated
>
> The reference to "Algorithmic probability" is talking about Solomonoff
> induction - using a formal theory of induction to figure out the
> probability of what is most likely to happen next, given your historical
> sense data.
Ok. I don't formally know what Solomonoff induction is, but I can guess as
to it's basic idea.
> The stuff about "universes" is needed to fit in with the formalism of
> AIXI.
>
> It is pretty conventional to assign utilities to "outcomes" in the
> "universe"
> even though all you know about the world comes from instincts and
> through your senses, which only sample a tiny fraction of it.
Ok but can the "outcome" be 100% correctly sensed in the sense data? Or
does it need oracle like powers to "know" what the outcome was?
If it needs more than what is in the sense data, the machine can't be built
and is nonsense in the context of talking about AI. AI machine behavior is
limited to what it can sense. And if it's in the sense data, then there is
no point in talking about "outcome" as if it were separate from something
which was computed from PAST sense data because PAST sense data is all it
can ever have access to since time travel is not yet possible.
"
The only problem with Solomonoff induction is that it is
incomputable, that is, it would require a computer with
infinite processing power to run. However, all successful
inductive schemes and machines -- including animals and
humans -- are approximations of Solomonoff induction."
Yeah, that's simple enough. I get the gist of what they are thinking even
though I don't know the exact formulas of Solomonoff induction.
It's the same idea as a problem which is solved "perfectly" by performing a
sort, but for a super large data set, it can't be sorted fast enough to be
useful, so what we end up doing, is creating an approximation that is not
"perfect" but which does a good enough job to be highly useful.
I would guess that Solomonoff induction requires a complete memory of the
entire past history that is reanalyzed for every decision - so the first
obvious simplification is to not remember everything, but to just keep a
summary and make the best estimates we can from the reduced summary
information.
There are computable approximations - e.g. Levin search:
http://www.scholarpedia.org/article/Universal_search
> Yeah, that's simple enough. I get the gist of what they are thinking even
> though I don't know the exact formulas of Solomonoff induction.
>
> It's the same idea as a problem which is solved "perfectly" by performing a
> sort, but for a super large data set, it can't be sorted fast enough to be
> useful, so what we end up doing, is creating an approximation that is not
> "perfect" but which does a good enough job to be highly useful.
>
> I would guess that Solomonoff induction requires a complete memory of the
> entire past history that is reanalyzed for every decision - so the first
> obvious simplification is to not remember everything, but to just keep a
> summary and make the best estimates we can from the reduced summary
> information.
Solomonoff induction is an abstract model of high-quality induction.
It takes a finite sequence and predicts the probabilities associated
with each
of the possible symbols that might come next in the stream by using their
Kolmogorov complexity. Essentially, it considers what fraction of computer
programs in a specified language produce such sequences as prefixes - in
the limit as program length goes to infinity, and then uses that information
to produce probability estimates for the next symbol.
So: it works with whatever length sequence you happen to have - but it
doesn't work so well with summaries or patchy data - if you want to
predict from them, you would probably have to use some other technique.
http://hagiograffiti.blogspot.com/2007/11/simple-heuristic-explanation-of.html
Could it invent a model of the Solar System from observational
data such as that collected by Tycho Brahe from which to make
predictions into the future state of the Solar System from any
given state?
There is much about AGI but what kinds of scientific work has
any of them actually achieved?
JC
>>> " The only problem with Solomonoff induction is that it is
>>> incomputable [...]"
>>
>>> http://www.wisegeek.com/what-is-solomonoff-induction.htm
>>
>> There are computable approximations - e.g. Levin search:
>>
>> http://www.scholarpedia.org/article/Universal_search
>
> Could it invent a model of the Solar System from observational
> data such as that collected by Tycho Brahe from which to make
> predictions into the future state of the Solar System from any
> given state?
Of course - given enough time. Calculating the K-complexity
of some data is done by finding the shortest program which
generates the data - which is then a concise model of that
data, of the type preferred by Occam's razor.
> There is much about AGI but what kinds of scientific work has
> any of them actually achieved?
Well, we don't know about efficient ways to perform approximations
of Solomonoff induction / Levin search yet. If we did: blue skies.
> However, in the case of AIXI, I think that algorithmic probability is
> fixed, and not really so easily reversible for _EVERY_ action. I would
> guess however that for any utility function, you could use rewards to TRAIN
> the reward maximize to act the same as utility function - so it would only
> be a question of how much training it would take before all the behavior of
> the utility function was trained into the reward maximizer. SO you just
> give it 1 reward when it make the decision the utility function would have
> make it choose,and 0 for every mistake. And if it was a good enough
> learner, it would in time converge on any utility function.
I *think* that is right. You could, in theory, train a R-L agent to produce
arbitrary finite sequences of actions in response to specified finite sense
data - assuming you are allowed to program its brain, its reward function
and give it fake memories dating back from before it was born.
That is essentially the same thing as was claimed for O-maximisers.