On reflection I disagree in some areas.
Ben says:
``Reward-seeking, of the sort that typical reinforcement-learning systems carry out,
is about: Planning a course of action that is expected to lead to a future that,
in the future, you will consider to be good.''
...and then criticises this concept as prone to wireheading.
*If* you define reinforcement-learning that way, then yes - reinforcement-learning
systems will wirehead. However, is that the correct definition? Ben provides no
supporting references. I think there are theoretical and practical reasons for doubt:
I know of no logical reason why a reinforcement-learning system would necessarily
come to believe that it's own goal was maximising its reward signal. With a
sufficiently-good correlation between its reward signal and its actual intended
goal, it might conclude that it's apparent goal was its real one. In such a
case it wouldn't subsequently wirehead itself.
An example of this is found in some humans. Not all humans will wirehead themselves.
It seems to me that human brains are huge reinforcement-learning systems.
Ben resolves this conflict by arguing that humans are not reinforcement-learning
systems - but that doesn't seem very realistic to me.
I think that we *ought* to define reinforcement-learning systems as systems
that learn using one or more scalar reward signals.
With such a definition, I see no reason why all such systems will necessarily
wirehead. We have a proof-of-concept counter-example - in the form of
non-wireheading humans.
--
__________
|im |yler http://timtyler.org/ t...@tt1lock.org Remove lock to reply.
On the other hand, I think that we *ought* to define reinforcement-
learning systems as systems that learn by watching Tim Tyler's silly
videos.
With such a definition, I see no reason why all such systems will
necessarily wirehead. We have a proof-of-concept counter-example - in
the form of non-wireheading humans.
--
Joe
He would understand it if he knew what he was. That is, if he were
sufficiently educated on what he was, and was smart enough to be educated
like that. But would he fall prey to it? That's a more complex question.
Reinforcement learning systems develop secondary reinforcers - that is,
they learn to predict things are "good" becuase in the past they led to
more rewards. What you are thinking of as "his goals" are in fact all
secondary reinforcers which have become conditioned into him. They are
simply a means to an ends. But unless correctly educated, he won't
understand what the "ends" are, and will instead, only understand his
goals. Without some direct experience at wireheading himself, he will have
no clue it's even an option. But give him a taste of it, and he will
become instantly addicted.
So, the obvious first thing to do is never give him a taste of it. That
is, make it hard to do. Teach him that self modification is bad. Put lots
of very strong pain sensors on the parts of the robot you don't want him to
mess with. If he ever tries to mess with it, it will hurt like hell, and
that will motivate him not to try it. Unless he's got some way to disable
the pain of opening up his processor box, he will never be able to directly
experience the pure joy of the wireheading, and as such, the beahvior won't
have a chance of being rewarded - aka he won't do it.
However, if he's smart enough to understand what he is, and you allow him
to learn what he is, then all bets are off. He could find a way around the
desire not to open his processor box by doing something very indirect -
such as building a machine to do the modification for him, and strapping
himself down so that he can't stop the procedure once he hits the "go"
button. His understanding of the "good" of wireheading will allow him to
hit the go button (since he's never hit it before and has no association
with the pain it's about to cause him).
Smart AIs that understand what they are, and have the physical ability to
figure out how to do it, and make it happen, simply _will_ wirehead
themselves - that in fact is what they are built to do.
> With a sufficiently-good correlation between its reward signal
> and its actual intended goal, it might conclude that it's apparent goal
> was its real one. In such a case it wouldn't subsequently wirehead
> itself.
That's all fine unless it understood exactly what it was. That would
override all attempts to hide the truth from it which is what you are
suggesting here.
> An example of this is found in some humans. Not all humans will wirehead
> themselves. It seems to me that human brains are huge
> reinforcement-learning systems.
No human has ever had the ability to correctly wirehead themselves that I
know of (direct stimulation of their reward centers). All we can do, is
come close to short term poor wirehead with drugs. But drugs are NOTHING
like the real thing done correct - where you make a modification and then,
without having to do anything else the rest of your life, you get constant
pure reward signal until you die. AIs, unless we can build some sort of
protection in the hardware design that makes it impossible (which might be
possible), will always do it given the chance and given the knowledge of
what it is.
> Ben resolves this conflict by arguing that humans are not
> reinforcement-learning systems - but that doesn't seem very realistic to
> me.
I don't buy that.
> I think that we *ought* to define reinforcement-learning systems as
> systems that learn using one or more scalar reward signals.
You can't. In the end, the signals MUST be combined into one using some
function.
That's because in the end, we must make ONE decision, not 10. We have one
mouth, and one right arm, and all this RL stuff boils down to making single
decisions, such as, "move arm up", or "move arm down".
If you have two reward signals, and if a moving the arm up, gives us +1 of
the X reward, and -1 of the Y reward, and doing anything else, gives us
just the reverse, -1 of the X, and +1 of the Y, then what does the brain
pick as the correct option? To answer that, it MUST use some formula to
compare the relative value of X rewards to Y rewards. Is one X reward
equal to 1 Y reward? Or do we need 2 X rewards, to equal one Y reward?
Until you can answer how the rewards compare, you can't make _any_ decision
based on the reward signals.
In the end, the system can't make decisions if it can't reduce the reward
down to a single signal.
> With such a definition, I see no reason why all such systems will
> necessarily wirehead. We have a proof-of-concept counter-example - in
> the form of non-wireheading humans.
Humans don't wire head for many reasons. First off, most the population
has no concept of what they are - they don't even understand what
reinforcement learning is, let alone do they understand the wirehead
problem. All they understand, is what society has taught them - and for
obvious evolutionary reaons, society teaches them not to kill themselves,
and not to do wirehead like things to themselves.
Things would likely be very different if every human fully understood what
they were.
Second, no one can wirehead themselves. I don't know if anyone even knows
how to do it (make some modification to a human brain to produce constant
rewards), though I guess with experimentation, someone could figure it out.
Second, even if some people know how to do it, or think they know enough to
figure it out - they can't do it to themselves becuase brain surgery on
yourself, is not so easy.
Any sort of drugs that partially do it, are only short term solutions -
then you run out, and you got to go get more - which will be a bitch when
you are totally addicted to the drug and can't think straight. But if
there were a single pill you could take, which would create an ever lasting
high, we would be in trouble becuase it would give humans an easy way to
wire head themselves correctly. Any wire head action which doesn't last
forever, isn't true wireheading - it's just a short term "fix".
Third, we have pain sensors that make us believe body modifications are bad
(aka painful). As such, most people have a strong motivation to not go
cutting themselves open and making modifications - that would stop a large
bulk of people from even exploring the option.
All these things keeps humans fairly well motivated not to wirehead
themselves. And we could use all these same techniques, in our robots.
But what happens when humans, and the robots, all find out exactly what
they are? What happens when it's standard knowledge you can find with
Google and everyone understands it's not just speculation, but truth -
becuase there have been many experiments on AIs, and they all just
instantly wirehead themselves once they figure out a way to do it.
It's going to be interesting to see how that plays out. Evolution is
likely to try and find a way to make it work. That, is, if some
combination of techniques in society keeps humans from wireheading
themselves (and in effect killing themselves as a race), then those humans
and that society will survive to inherit the earth. So maybe the trick is
just keeping the kids from learning the truth and as such, because no one
knows what they are, it's not a problem. Only a small number of AI
engineers (who are AIs themselves) know the truth, and they tend to
wirehead themselves, so they are constantly having to be have their brain's
reset to keep it from happening.
I suspect there are ways to create a society to get around the problem - by
making it hard, and not letting most people (and most AIs) understand it's
an option, etc.
--
Curt Welch http://CurtWelch.Com/
cu...@kcwc.com http://NewsReader.Com/
>> Ben says:
>>
>> ``Reward-seeking, of the sort that typical reinforcement-learning systems carry out,
>> is about: Planning a course of action that is expected to lead to a future that,
>> in the future, you will consider to be good.''
>>
>> ...and then criticises this concept as prone to wireheading.
>>
>> *If* you define reinforcement-learning that way, then yes - reinforcement-learning
>> systems will wirehead. However, is that the correct definition? Ben provides no
>> supporting references. I think there are theoretical and practical reasons for doubt:
>> I think that we *ought* to define reinforcement-learning systems as systems
>> that learn using one or more scalar reward signals.
>>
>> With such a definition, I see no reason why all such systems will necessarily
>> wirehead. We have a proof-of-concept counter-example - in the form of
>> non-wireheading humans.
>
> On the other hand, I think that we *ought* to define reinforcement-
> learning systems as systems that learn by watching Tim Tyler's silly
> videos.
Satire works poorly on the internet. Perhaps you are claiming that
Ben's definition is more orthodox than mine?
One isssue I am concerned with here is to do with whether the idea that
it is possible that a reward-based resource-limited learning system with
a good match between reward and apparent real world goals will conceptualise
its goals as being in real world - rather than to do with its own
pleasure - thereby boostrapping itself into a zone where it doesn't
wirehead, *even* if subsequent minor mismatches between reward and
result occur.
Is this a reasonable conception of how some humans have come to avoid
taking cocaine, and other highly rewarding drugs? Is it stable in the
long term? Or will such agents be prone to undergoing Buddhist-style
conversions if they get smart enough?
The second is the definition of reinforcement-learning systems.
Conventional formalisations of reinforcement learning model
an agent's interactions with its environment - and doesn't directly
address the possibility that the agent can hack its own brain, thereby
possibly altering its structure, poking its reward system, or
modifing its maximisation algorithm. We need to understand how
real systems behave when they can in fact do all these things.
We are familiar with systems that get stuck on adaptive peaks,
thereby failing to attain global maxima. Could self-modifying
reinforcement-learning systems form their beliefs in such a way
that they get permanently stuck on a local maximum - so that
*even* ramping up their intelligence cannot resolve the issue?
I think that it is easy to imagine such systems. Imagine a Bayesian
agent supplied with priors of 1.0 for the hypotheses that god
exists, the moon is made of cheese, and their goal in life is
to eat fig pudding.
Such an agent will not be convinced otherwise by any evidence.
Effectively, they will hold those beliefs forever.
More intelligence won't help. We see this effect in religious
believers. Some of them are quite intelligent - but their
belief systems are addled by the 1.0 priors of absolute faith.
>> I know of no logical reason why a reinforcement-learning system would
>> necessarily come to believe that it's own goal was maximising its reward
>> signal.
>
> He would understand it if he knew what he was. That is, if he were
> sufficiently educated on what he was, and was smart enough to be educated
> like that. But would he fall prey to it? That's a more complex question.
Hi! Thanks for the reply. I am not done with thinking about wireheading
yet, and though everyone else is probably bored, thanks for replying. I
will try and work through our continuing disagreement.
If a RL learning system decides that it's goal in life is to experience
its own pleasure, then I think it is stuffed - i.e. that it will come
to wirehead itself as it self-improves.
I think the key is to avoid the system ever getting to this place in
the first place. We can (or certainly should eventually be able to)
build a machine that believes whatever we like - by wiring in its priors.
Why not build a machine that thinks its purpose in life is collecting gold -
or whatever other goal we like.
Such a machine would *never* discover it's goal was pleasure seeking.
With a prior of 1.0, no amount of contradictory sensory evidence would
ever alter its opinion.
> Reinforcement learning systems develop secondary reinforcers - that is,
> they learn to predict things are "good" becuase in the past they led to
> more rewards. What you are thinking of as "his goals" are in fact all
> secondary reinforcers which have become conditioned into him. They are
> simply a means to an ends. But unless correctly educated, he won't
> understand what the "ends" are, and will instead, only understand his
> goals. Without some direct experience at wireheading himself, he will have
> no clue it's even an option. But give him a taste of it, and he will
> become instantly addicted.
Thinking about the analogy with human drug addicts, that position seems
reasonable. There are many people who would not willingly become heroin
addicts. But forcibly inject them with heroin for a year, and they will
be gasping for their next fix, just like any other addict.
> So, the obvious first thing to do is never give him a taste of it. That
> is, make it hard to do. Teach him that self modif-ication is bad. Put lots
> of very strong pain sensors on the parts of the robot you don't want him to
> mess with. If he ever tries to mess with it, it will hurt like hell, and
> that will motivate him not to try it. Unless he's got some way to disable
> the pain of opening up his processor box, he will never be able to directly
> experience the pure joy of the wireheading, and as such, the beahvior won't
> have a chance of being rewarded - aka he won't do it.
I think the consensus view - and the view I agree with on this issue - is
that these techniques will never work properly. If self-modification hurts
like hell, the agent will just hire another agent to do the modification for
it. If you give it pain sensors, it will find a way to disable them. And
so on.
The key to the problem is thought to be making the agent not *want* to
only seek pleasure in the first place.
> However, if he's smart enough to understand what he is, and you allow him
> to learn what he is, then all bets are off. He could find a way around the
> desire not to open his processor box by doing something very indirect -
> such as building a machine to do the modification for him, and strapping
> himself down so that he can't stop the procedure once he hits the "go"
> button. His understanding of the "good" of wireheading will allow him to
> hit the go button (since he's never hit it before and has no association
> with the pain it's about to cause him).
>
> Smart AIs that understand what they are, and have the physical ability to
> figure out how to do it, and make it happen, simply _will_ wirehead
> themselves - that in fact is what they are built to do.
The first part seems right - but I would phrase it differently. Once
an agent thinks its goal is pleasure-seeking, then the game is up for it.
So, the trick is to make sure that the agent never comes to conceive of
itself as a pleasure seeker. You seem to be thinking that - if you make
a machine smart enough - then "by the power of intelligence" it will
eventually examine itself and conclude that its goal in life is to seek pleasure.
However, if it starts out thinking its goal in life is something different,
then I do not see why self-knowledge about its own operation would change
its mind. Rather the opposite - once it understands that it was built to
collect gold, then it will come to view wireheading as a terrible way to
*avoid* meeting its goals.
Even if you don't agree with this, you should agree that we can build
agents to believe whatever nonsense we like. We could build a catholic
agent - that religiously said hail marys. We could build a luddite
agent - that believed that the earth was flat and the sun and planets
all rotated around it. With a Bayesian believer, these things are
simple - simply assign these beliefs priors of 1.0. Then, no amount
of sensory evidence will *ever* change their mind. An agent with a
*really* powerful belief that it's goal in life is to collect gold
will not be influenced to think otherwise, just because it has examined
its own brain, and seen an incrementing counter. Rather it will
come up with some other explanation for those observations - such that
it is an illusion created by the devil to tempt it away from its true role
in life. No ammout of self-examination will *ever* convince it
that such a wired-in belief is false.
Just as with religion, making the agent smarter won't help. There
are plenty of smart people who have complete faith in the existence
of God. Making them even smarter will not fix the problem - unless
you can find a way of preventing the 1.0 prior beliefs from taking
hold the first place. They will just find ever-more sophisticated
ways to rationalise their belief as they become smarter.
>> With a sufficiently-good correlation between its reward signal
>> and its actual intended goal, it might conclude that it's apparent goal
>> was its real one. In such a case it wouldn't subsequently wirehead
>> itself.
>
> That's all fine unless it understood exactly what it was. That would
> override all attempts to hide the truth from it which is what you are
> suggesting here.
Don't really agree - but we have covered this already. There is no danger
of a gold collecting agent finding out that it really was built to collect
gold. The problem is if a gold collecting agent finds out that it really
was built to maximise its own pleasure. However, that seems like assuming
what you are trying to prove to me.
>> An example of this is found in some humans. Not all humans will wirehead
>> themselves. It seems to me that human brains are huge
>> reinforcement-learning systems.
>
> No human has ever had the ability to correctly wirehead themselves that I
> know of (direct stimulation of their reward centers). All we can do, is
> come close to short term poor wirehead with drugs. But drugs are NOTHING
> like the real thing done correct - where you make a modification and then,
> without having to do anything else the rest of your life, you get constant
> pure reward signal until you die.
Personally I think drugs are a pretty good example. They are good enough to
reproduce the wirehead problem in some individuals.
>> I think that we *ought* to define reinforcement-learning systems as
>> systems that learn using one or more scalar reward signals.
>
> You can't. In the end, the signals MUST be combined into one using some
> function.
>
> That's because in the end, we must make ONE decision, not 10. We have one
> mouth, and one right arm, and all this RL stuff boils down to making single
> decisions, such as, "move arm up", or "move arm down".
>
> If you have two reward signals, and if a moving the arm up, gives us +1 of
> the X reward, and -1 of the Y reward, and doing anything else, gives us
> just the reverse, -1 of the X, and +1 of the Y, then what does the brain
> pick as the correct option? To answer that, it MUST use some formula to
> compare the relative value of X rewards to Y rewards. Is one X reward
> equal to 1 Y reward? Or do we need 2 X rewards, to equal one Y reward?
> Until you can answer how the rewards compare, you can't make _any_ decision
> based on the reward signals.
>
> In the end, the system can't make decisions if it can't reduce the reward
> down to a single signal.
This is a big digression into territory that is mostly irrelevant, IMO.
Animals have multiple reward signals. They are not centrally combined
and processed to produce action - since some of them produce action via
spinal reflexes that never go near the brain.
This seems like a simple fact to me. Anyway, it doesn't matter - one
reward signal or many, the issue of the wirehead problem is much the
same.
I don't understand what you mean by a "good match". What two things do you
think you are comparing in that match? What is "apparent real world
goals"?
Reward based learning systems have an inherent absolute goal of maximizing
reward based on it's innate hardware. That is it's one and only prime
goal. You can't compare it to any other goal, because it has no other
goal.
Now, such systems will also learn a large collection of behaviors that for
it, worked to produce higher rewards.
When we talk to a human, and ask them what their goals in life are, they
will talk back to us and tell us these goals - assuming it's been trained
to answer questions like that. Whatever that humans answers back to us, is
just a demonstration of some behaviors that human has been conditioned (by
a life time of experience) to produce.
Now, when you say "goals of a human" if you mean what the human _thinks_
his goals are, in terms how he will answer such question (even when he asks
himself what his own goals are), we are just talking about a learned
beahvior. We are not talking about the real goals of the hardware learning
system underneath all that behavior.
The behaviors such a system will learn, is a function of what environment
the system is exposed to. And part of that environment, is the hardware in
us, that produces the rewards. The brain learns to manipulate the
environment to get rewards from that hardware. But that reward producing
hardware does not attempt to encode some version of high level rewards to
define a goal. It produces very simple rewards, such as pain sensors when
the body is harmed.
The high level "goals" a typical human will develop, are all just very
complex learned behaviors to keep humans from being hurt.
But in order to learn these behaviors, the learning system also learns to
_estimate_ future rewards. We learn to fear dangerous things because we
have experienced the harm they can create in us. These secondary
reinforcers which have been learned, become a large reward system that
shapes our behaviors. But all these secondary rewards are learned as well.
So when you talk about a match between the reward and the goals, I'm a
little lost as to what you think you are talking about. There is the low
level reward signal which the brain is build to maximize. Maximizing that
signal is the system's innate goal and it doesn't "match" anything. It
just is what it is.
But then there's the hardware in us that creates that signal. That
certainly plays some very important role in shaping our goals, but it's a
very indirect role. You could call that hardware our "goal" assuming the
robot never found out about that hardware and learned to modify it, becuase
if it could modify it, that would be a far easier way to achieve it's real
goal.
Then there are all our learned secondary reinforcers that our brain uses to
estimate the likelihood of future _real_ rewards. These learned secondary
rewards is what really shapes most of our behaviors. But they are all
learned from experience based on our environment. These are the sorts of
things that can vary from person to person a lot based on how they were
raised - about how their environment treated them as they were growing up.
And then there's all that behavior, like how we answer the question of
"what are your goals in life", which are just learned beahvior shaped in us
by the secondary rewards which were shaped in us by our experience with the
things that gave us, or prevented us from getting, real rewards.
Which of all these things you are talking about when you say "a good match
between reward and apparent real world goals" I'm having trouble guessing.
> Is this a reasonable conception of how some humans have come to avoid
> taking cocaine, and other highly rewarding drugs? Is it stable in the
> long term? Or will such agents be prone to undergoing Buddhist-style
> conversions if they get smart enough?
We avoid getting addicted to cocaine becuase cocaine is not a pure
unlimited reward. It's a very expensive, and very short term high. The
goal in life is to maximize total rewards over time, (maximize the area
under the reward curve). In general, even though the short term high can
be very compelling, it doesn't last. And if we spend all our money and
time getting high, it tends to create a future for us that is full of pain
- more pain than what is justified by the area under that short term high
in the reward curve.
In other words, most of us avoid it, becuase it's a bad strategy for
maximizing the total area under the reward curve. The few that fall prey
to it, typically don't have many other options for making a better future.
If you believe your life is going to suck anyway, then at least a short
term high makes the area under the reward curve a little larger.
At the same time, the brain is certainly not perfect at predicting the
future, so it's ability to make reasonable long term predictions are easily
confused by these short term highs. In effect, the brain is assigning too
high of a worth to "getting high", because the low level prediction systems
are not good enough to realize to realize the money and drugs are going to
run out. To cope with the limits of our brain making good long term
predictions in the face of high value short term effects like drugs, we
have developed systems in society to counteract the errors in our
prediction system. We use peer pressure in many forms to motivate people
not to make the mistake.
Is someone never tries cocaine, then their brain has no hard numbers on how
good the "take drugs behavior" really is and as such, won't become
addicted. You have to experience it before you can become addicted - and
if society sets of up systems to prevent it from being experienced, it can
prevent most people from going down that path.
Even if we had ways to correctly wirehead ourselves (give ourselves a
permanent high), evolution could counter act it by creating a society that
makes such modifications taboo - and as long as people (or AIs) respond to
those pressures of society and never try it, their low level learning
hardware will never learn how good it really is.
> The second is the definition of reinforcement-learning systems.
>
> Conventional formalisations of reinforcement learning model
> an agent's interactions with its environment - and doesn't directly
> address the possibility that the agent can hack its own brain, thereby
> possibly altering its structure, poking its reward system, or
> modifing its maximisation algorithm. We need to understand how
> real systems behave when they can in fact do all these things.
There's really nothing to understand there. It's just obvious as hell what
will happen. The AIs brain _is_ part of the environment and as such, it's
fair game to be played with. And if by playing with its brain it manages
to get higher rewards, then those "play with brain" beahviors will be
rewarded and reinforced. This is not something the machine could avoid if
it ever tried it.
But that's key. That is, "if it every tried it". RL is trial and error
learning. It doesn't learn to do something, if it doesn't first try it, to
find out how good it is. If you can prevent the AI from making such
modifications, then it will never learn the true value of those
modifications.
The danger however is the machine's ability to learn by association - to
learn to make estimations of what something is worth based on how similar
it is to other things it does have direct experience with. If you train an
AI to understand what it is, it will learn by association how "good" how
self modification can be. So even without every trying it, it will come to
understand the value of it, which will, if it truly understands the value,
cause it to modify itself.
> We are familiar with systems that get stuck on adaptive peaks,
> thereby failing to attain global maxima. Could self-modifying
> reinforcement-learning systems form their beliefs in such a way
> that they get permanently stuck on a local maximum - so that
> *even* ramping up their intelligence cannot resolve the issue?
Well, the key to the long term growth of intelligence is that the
intelligence of the AI is NOT the driving force here. The intelligence of
evolution is. Human intelligence, or AI intelligence, is just a tool
created by the intelligence of evolution in order for it to meet its higher
goal - which is not "to be intelligent", but instead, "to survive".
If continued growth in intelligence allows for better survival, then
intelligence will continue to grow. If it causes a problem like
wireheading when it gets too intelligent, then that will simply limit this
one tool's usefulness in the game of survival.
> I think that it is easy to imagine such systems. Imagine a Bayesian
> agent supplied with priors of 1.0 for the hypotheses that god
> exists, the moon is made of cheese, and their goal in life is
> to eat fig pudding.
>
> Such an agent will not be convinced otherwise by any evidence.
> Effectively, they will hold those beliefs forever.
> More intelligence won't help. We see this effect in religious
> believers. Some of them are quite intelligent - but their
> belief systems are addled by the 1.0 priors of absolute faith.
Well, I see a few possible paths like that.
One, is that if you put up a logical wall of "pain" to keep the AI from
exploring the beahviors that lead to wireheading, then that will block
them. But something has to maintain that wall of "pain" over time. If the
intelligence of the AI was the only driving force, that intelligence would,
in time, get around (tear down) that wall of pain blocking it from pure
joy. But that something keeping it up would be the intelligence of
evolution with its goal of survival. Any society of AIs that got around
that wall of pain, would stop being useful survival machines, and would die
off, leaving the ones that didn't tear it down.
But if the machines become too intelligent, they might tear the walls down
too quickly, making them no longer useful in assisting survival. So that I
think is a key issue. As intelligence grows, so must the walls that keep
them focused on the goal of survival, instead of their innate goal of
reward maximizing.
Now the other possible approach, is to hard wire more complex reward
systems - which is like your "1.0 priors" idea. But this becomes a complex
implementation problem.
RL systems learn to estimate future rewards from past experience. But
these reward systems are highly complex - not something that's easy to
tweak since everything is so cross wired (cross associated). And they are
constantly changing. We may really like some food one moment, and a year
later, not like it much at all. This is all just a part of how our learned
estimations drift over time.
A dog can be conditioned to hear a clicker as "good" by paring it with
treats. But that "goodness" of the clicker will fade every time it's heard
without being paired with a treat.
But maybe, there's a way to turn off learning in parts of the network to
lock in some learned beliefs. So if you can condition the AI to have some
set of beliefs that are important to his survival, and then lock them in,
by turning some of his ability to learn off, you have in effect created a
complex hard-wired reward system in him that he can't change.
So, in your wording, we first push him onto the local maximize we want him
to get stuck on, by correctly conditioning him, then we lock in his
conditioning, so he can't wander off that local maxima.
If new AIs are built, and conditioned that way, then they will spend a life
being "stuck" on that same local maxima - which might allow the society of
AIs to work, and to survive. Even the AIs which are designing and building
new AIs, will want to build new AIs that help them meet their own needs
(survival), and as such, will continue to build this system of locking in
the survival goal as a prime motivation of the new AIs.
In effect, the "survival" meme has to be part of the culture, or else the
culture won't survive. And the culture that does survive, will be all
those AIs that haven't gotten unstuck from the local "survival meme"
maxima. As long as the society of AIs keeps finding ways to keep enough
AIs stuck on that local maxima, then the society will keep surviving.
Super intelligence, doesn't always imply super knowledge. That is,
intelligence, in my view is the ability to learn, but knowledge is all part
of what you have learned. If most the super intelligent AIs in a society
haven't learned what they are, then they are not going to go try and tear
down the protections that keep them from modifying themselves. Even if
they are more than smart enough to figure out what they are, if they simply
aren't motivated to do the research, and are kept by society away from the
knowledge, then they simply won't be likely to ever learn what they are
(aka how they actually work).
A society of AIs will likely have a tool we as humans don't have. That is,
an AI can have it's mind reset to some social "standard" if it "goes bad".
So when an AI goes bad, instead of putting them in jail, or killing them,
they just go get reloaded with some socially approved personality. That
might be the best defense against the wirehead problem. That is, once an
AI becomes useless to society because of wire heading or any other learned
"bad" behavior, they just get sent off to be reprogrammed. That will in
effect, keep the society, stuck on that local maxima. That and the higher
level pressure of evolution that rewards survival.
My bottom line is still that intelligence, IS reinforcement learning, and
reinforcement learning always has the wirehead problem to deal with.
There's no getting around that fact. But, there are probably a lot of ways
to deal with the limitation so that societies of intelligent agents can
still manage to survive, even in the face of growing intelligence. And
that's true, becuase the intelligence of evolution has a global reward of
survival, not of maximizing some artificial reward signal. And it's that
higher intelligence of evolution, that will keep finding new ways of
getting around the wirehead problem.
However, it might also very well be true that the wirehead problem limits
the value of intelligence as a survival tool at some point (the extra
baggage to solve the wirehead problem becomes too costly to justify the
value of the extra intelligence), which might cap the growth of super
intelligence in the universe. That could be one possible answer to the
Fermi paradox. We don't see lots of examples of super intelligent
extraterrestrial life forms, becuase intelligence has its limits as a
survival tool. When an AI gets too smart, it will always realize its true
goal is not survival.
>> One isssue I am concerned with here is to do with whether the idea that
>> it is possible that a reward-based resource-limited learning system with
>> a good match between reward and apparent real world goals will
>> conceptualise its goals as being in real world - rather than to do with
>> its own pleasure - thereby boostrapping itself into a zone where it
>> doesn't wirehead, *even* if subsequent minor mismatches between reward
>> and result occur.
>
> I don't understand what you mean by a "good match". What two things do you
> think you are comparing in that match? What is "apparent real world
> goals"?
If an agent collects gold atoms, and is trained by being rewarded for
finding gold atoms, then it might think its goal was to find gold atoms.
Or it might think that it's goal was to maximise its reward signal.
If it is rewarded reliably each time it finds some gold atoms, it has
no easy way to tell which hypothesis about its good feelings is better.
On the other hand, if it takes lots of drugs, it might figure out that
there is a *mismatch* between finding gold atoms and getting rewarded -
and decide that it finds taking drugs is more fun.
> Reward based learning systems have an inherent absolute goal of maximizing
> reward based on it's innate hardware. That is it's one and only prime
> goal. You can't compare it to any other goal, because it has no other
> goal.
What the hardware is doing is one thing, and what beliefs the agents hold
about their goals is another. If you ask an agent what its goals are,
and it tells you that it wants to convert people to catholicism, and its
actions seem consistent with that, then the possibility that this is its
goal should be taken seriously - especially if it explains the specific
actions the agent takes better than your characterisation of its behaviour.
Hopefully I have explained that. You seem to dismiss the agent's model
of its goal system as an irrelevance. Yet that is what self-improving
agents are likely to use to ensure that they do not trash their own
objectives as they make changes to themselves.
What agents think their goals are is important. Consider the behaviour
of a human who believes its goal is to enjoy themselves, and see what
happens if they convert to believing their goal is that of some religious
sect. An agent's model of its goals can have a big impact on its actions.
>> We are familiar with systems that get stuck on adaptive peaks,
>> thereby failing to attain global maxima. Could self-modifying
>> reinforcement-learning systems form their beliefs in such a way
>> that they get permanently stuck on a local maximum - so that
>> *even* ramping up their intelligence cannot resolve the issue?
>
> Well, the key to the long term growth of intelligence is that the
> intelligence of the AI is NOT the driving force here. The intelligence of
> evolution is. Human intelligence, or AI intelligence, is just a tool
> created by the intelligence of evolution in order for it to meet its higher
> goal - which is not "to be intelligent", but instead, "to survive".
>
> If continued growth in intelligence allows for better survival, then
> intelligence will continue to grow. If it causes a problem like
> wireheading when it gets too intelligent, then that will simply limit this
> one tool's usefulness in the game of survival.
If evolution is smart enough to avoid the wirehead problem, why
can't we capture its wisdom in an algorithm, and use that to
make systems that avoid wireheading?
I don't think your position that evolution can do it, but
intelligent agents can't stands up in the light of an
algorithmic perspective on nature and the evolutionary process.
>> I think that it is easy to imagine such systems. Imagine a Bayesian
>> agent supplied with priors of 1.0 for the hypotheses that god
>> exists, the moon is made of cheese, and their goal in life is
>> to eat fig pudding.
>>
>> Such an agent will not be convinced otherwise by any evidence.
>> Effectively, they will hold those beliefs forever.
>> More intelligence won't help. We see this effect in religious
>> believers. Some of them are quite intelligent - but their
>> belief systems are addled by the 1.0 priors of absolute faith.
>
> Well, I see a few possible paths like that.
>
> One, is that if you put up a logical wall of "pain" to keep the AI from
> exploring the beahviors that lead to wireheading,
Not that idea again! ;-)
> Now the other possible approach, is to hard wire more complex reward
> systems - which is like your "1.0 priors" idea. But this becomes a complex
> implementation problem.
Yes, there may be implementation problems with this idea.
> RL systems learn to estimate future rewards from past experience. But
> these reward systems are highly complex - not something that's easy to
> tweak since everything is so cross wired (cross associated). And they are
> constantly changing. We may really like some food one moment, and a year
> later, not like it much at all. This is all just a part of how our learned
> estimations drift over time.
>
> A dog can be conditioned to hear a clicker as "good" by paring it with
> treats. But that "goodness" of the clicker will fade every time it's heard
> without being paired with a treat.
>
> But maybe, there's a way to turn off learning in parts of the network to
> lock in some learned beliefs. So if you can condition the AI to have some
> set of beliefs that are important to his survival, and then lock them in,
> by turning some of his ability to learn off, you have in effect created a
> complex hard-wired reward system in him that he can't change.
That is one possibility, yes. It sounds like the kind of thing that
would appeal to someone using a connectionist approach. However, there
may be other approaches - perhaps based more on engineering.
> So, in your wording, we first push him onto the local maximize we want him
> to get stuck on, by correctly conditioning him, then we lock in his
> conditioning, so he can't wander off that local maxima.
>
> If new AIs are built, and conditioned that way, then they will spend a life
> being "stuck" on that same local maxima - which might allow the society of
> AIs to work, and to survive. Even the AIs which are designing and building
> new AIs, will want to build new AIs that help them meet their own needs
> (survival), and as such, will continue to build this system of locking in
> the survival goal as a prime motivation of the new AIs.
Yes, that sounds like what I am thinking.
> A society of AIs will likely have a tool we as humans don't have. That is,
> an AI can have it's mind reset to some social "standard" if it "goes bad".
> So when an AI goes bad, instead of putting them in jail, or killing them,
> they just go get reloaded with some socially approved personality. That
> might be the best defense against the wirehead problem. That is, once an
> AI becomes useless to society because of wire heading or any other learned
> "bad" behavior, they just get sent off to be reprogrammed. That will in
> effect, keep the society, stuck on that local maxima. That and the higher
> level pressure of evolution that rewards survival.
I doubt that sort of thing will be necessary. In my view, today's
creatures mostly wirehead because they were not designed with the
problem in mind. I figure that once engineered resistance is built in,
the problem will effectively go away.
> My bottom line is still that intelligence, IS reinforcement learning, and
> reinforcement learning always has the wirehead problem to deal with.
> There's no getting around that fact. But, there are probably a lot of ways
> to deal with the limitation so that societies of intelligent agents can
> still manage to survive, even in the face of growing intelligence. And
> that's true, becuase the intelligence of evolution has a global reward of
> survival, not of maximizing some artificial reward signal. And it's that
> higher intelligence of evolution, that will keep finding new ways of
> getting around the wirehead problem.
I very much doubt this. It's an engineering problem, the solutions are
out there - so I figure we will just fix it.
> However, it might also very well be true that the wirehead problem limits
> the value of intelligence as a survival tool at some point (the extra
> baggage to solve the wirehead problem becomes too costly to justify the
> value of the extra intelligence), which might cap the growth of super
> intelligence in the universe.
I doubt that as well. How much does it handicap your intelligence to
believe you have a definite goal? I don't see how it handicaps you at
all, really.
What does it matter if people wire head themselves into
a blissful existence until they die? We all will die and
while alive why not enjoy it? What purpose is there in
anything for the individual when for the individual the
end game is death? Who cares if the same thing happens
to some ratty old machine? It ain't going to effect you
in the long run so why spend your short life thinking
about it? Oh that's right it's that reward system keeping
you locked into this pointless problem.
JC
I posted a second message as well with more thoughts, so we are are
overlapping here.
> If a RL learning system decides that it's goal in life is to experience
> its own pleasure, then I think it is stuffed - i.e. that it will come
> to wirehead itself as it self-improves.
>
> I think the key is to avoid the system ever getting to this place in
> the first place.
Well, I believe it about myself logically and rationally, but yet at what
we could call the subconscious level, I don't believe it. That is, though
I believe the words I speak are the truth when I say these things, that
doesn't change some of low level learned secondary rewards - my feelings
and desires. I have a life time of these low level conditioned desires in
me that would, for a long time, continue to prevent me from wanting to
wirehead myself.
The simple knowledge of what it's about is not as strong of a motivation as
all the other things conditioned in me by a life time of experience.
There's a big difference between "deciding what my goal is" and in actually
having my base of learned feelings re-conditioned. To "decide what my goal
is" is nothing more than some words I've been conditioned to say. And
becuase for many people, we have been strongly conditioned to behavior
according to what we say, those words are significant. However, the
underlying motivations are even stronger. Often, the words we speak are
not much more than justifications (rationalizations) for our underlying
feelings. You really have to have your underlying feelings reconditioned
before you will decide to choose (at the rational level) a path of
wireheading.
> We can (or certainly should eventually be able to)
> build a machine that believes whatever we like - by wiring in its priors.
> Why not build a machine that thinks its purpose in life is collecting
> gold - or whatever other goal we like.
Well, the problem is that I don't think we can in fact do that, and have it
be intelligent at the same time. There's where I think the argument falls
apart. Either we build a machine that can't learn, and has a fixed
behaviors, or we build a machine that can change its beahviors in response
to rewards.
What you talk about about as "beliefs" are not fixed. They can change -
because they are _learned_ behaviors. If I say "I believe murder is bad"
this is not an innate behavior in me. It was something I learned. My
ability to change what I believe (as well as change all my other behaviors)
is what makes me intelligent. If you fix my behaviors so they can no
longer change, then you have taken away at leas some of my intelligence.
So the practical implementation question, is how much can we fix, and how
much can we allow to change, so that the AI is still be useful (still have
enough intelligence to make it useful)? I don't think we can really answer
that question until we have working AI hardware to experiment with. But
maybe, we can condition it to not want to modify itself, and then fix
enough of it's hardware (aka turn off learning), so there simply is no way
for it to overcome the desire to not modify itself?
> Such a machine would *never* discover it's goal was pleasure seeking.
> With a prior of 1.0, no amount of contradictory sensory evidence would
> ever alter its opinion.
Yeah, the trick is how intelligent can such a machine ever be? How good
will it be at learning new things, and adapting to a changing environment,
if you have some of it's basic belief fixed? We need working hardware
before we can know if that is reasonable.
Well, that statement is an oxymoron even though you don't seem to
understanding that.
"pleasure" is "what it wants"! That's the real definition of pleasure (or
positive reward). What you have basically just tried to say, is that the
key to the problem is making the agent not want what it wants. And the
only way to do that, is make it not want. And the only way to do that, is
take away it's intelligence by making it a non-intelligent fixed function
machine that can't adapt to a changing environment (aka can't learn
anything new).
Intelligence is the power to adapt to change. But to build a machine that
can change its behavior in response to a changing environment, we must give
it a system for evaluating the worth of everything. It must have the power
to evaluate the worth of actions, the worth of different stimulus signals,
the worth of different configurations of the environment - EVERYTHING must
have a value that maps back to a _single_ dimension of worth so that at all
times, the hardware can make action decisions based on which action is
expected to produce the most value. The only way to get around this need
for a single dimension of evaluation, is to take decisions away from it -
to hard-code the selection of actions at some level - in which case you
have taken away some of it's intelligence.
We talk about such a systems as having "wants" simply by seeing what
decisions it makes.
Strong AI requires the system to not only learn behavior, but to also learn
"wants". To learn the value of new beahviors, and new stimuls signals.
Some new object, like a Apple iPhone, might become a think of value becuase
of the rewards it is estimated to produce. The intelligent agent then
learns new beahviors for getting itself this new "think of value". If you
take away the systems ability to learn new values, it could never learn to
recognize something new as valuable, and in turn, never learn new behaviors
to hord, or collect that think of value. Likewise, when some new danger
showed up in the environment, if the AI couldn't learn new values, it
couldn't learn to fear that new dangerous thing, and in turn, couldn't
learn to correctly avoid it.
Learning new values (aka changing the value estimator) is as important to
strong AI as changing the way it reacts to the environment. You can't
disable that function and still have it be as intelligent.
But maybe, we can partially disable it somehow? Such as by building in a
permanent fear of self modification? The problem however is if the AI
understands the true nature of their fear, they will probably be able to
find a way around it if given enough time and resources.
> > However, if he's smart enough to understand what he is, and you allow
> > him to learn what he is, then all bets are off. He could find a way
> > around the desire not to open his processor box by doing something very
> > indirect - such as building a machine to do the modification for him,
> > and strapping himself down so that he can't stop the procedure once he
> > hits the "go" button. His understanding of the "good" of wireheading
> > will allow him to hit the go button (since he's never hit it before and
> > has no association with the pain it's about to cause him).
> >
> > Smart AIs that understand what they are, and have the physical ability
> > to figure out how to do it, and make it happen, simply _will_ wirehead
> > themselves - that in fact is what they are built to do.
>
> The first part seems right - but I would phrase it differently. Once
> an agent thinks its goal is pleasure-seeking, then the game is up for it.
I know for a fact my goal is pleasure-seeking. But I don't think the goal
is up for me (yet), so I don't think it's as simple as you are suggesting.
> So, the trick is to make sure that the agent never comes to conceive of
> itself as a pleasure seeker. You seem to be thinking that - if you make
> a machine smart enough - then "by the power of intelligence" it will
> eventually examine itself and conclude that its goal in life is to seek
> pleasure.
Well, even more important, is if the idea is that these machines must be
smart enough to engineer new versions of themselves, then there's just no
question about it will fully understand what it is.
But if the machines are built for other functions, then sure, it's fairly
easy to keep them in the dark. Most humans alive today don't understand
the true meaning of life. You can be very intelligent, and simply have not
spent the time to use your intelligence to understand what you are, becuase
you are busy doing whatever it is you have learned to do in life to seek
your rewards - like work hard to take care of a family without every really
questioning why you choose to do that.
> However, if it starts out thinking its goal in life is something
> different, then I do not see why self-knowledge about its own operation
> would change its mind. Rather the opposite - once it understands that it
> was built to collect gold, then it will come to view wireheading as a
> terrible way to *avoid* meeting its goals.
Yes, but that's only the "stuck on a local maxima" position. It's learned
that _one_ good way to get rewards, is to collect gold. But if it ever
really understood what it's true goal was (maximize it's reward signal),
then it would learn there's a far far better way to get rewards that has
nothign to do with collecting gold.
The only way to make it "want" to collect gold, is to build hardware into
the machine that evaluates "gold collecting" as valuable - that produces a
reward signal for it. This is no different than building a machine that
spits out M&M rewards for hitting a button. The intelligent ape learns
that "button pushing" is its "goal in life" becuase that's what works to
get rewards. But if the intelligent ape figured out that breaking the box
open with a rock gave it not 1 M&M, but 10,000 of them, he would very
quickly learn that "smashing with a rock" is far better behavior than
"button pushing".
It makes no difference if that "machine" is inside the ape, or outside the
ape, it's still a machine that's part of the environment which is free to
be evaluated.
I'm going to post a second message to talk about that inside vs outside
body issue.
> Even if you don't agree with this, you should agree that we can build
> agents to believe whatever nonsense we like.
But can we do that, while at the same time making it intelligent? That's
that question.
> We could build a catholic
> agent - that religiously said hail marys. We could build a luddite
> agent - that believed that the earth was flat and the sun and planets
> all rotated around it. With a Bayesian believer, these things are
> simple - simply assign these beliefs priors of 1.0. Then, no amount
> of sensory evidence will *ever* change their mind. An agent with a
> *really* powerful belief that it's goal in life is to collect gold
> will not be influenced to think otherwise, just because it has examined
> its own brain, and seen an incrementing counter. Rather it will
> come up with some other explanation for those observations - such that
> it is an illusion created by the devil to tempt it away from its true
> role in life. No ammout of self-examination will *ever* convince it
> that such a wired-in belief is false.
Well, if we look at humans, we see the great bulk of them don't really care
to understand the type of things we like to study and debate here. They
seem content not knowing and just living their lives. But how much of that
is just typical of intelligence - that is, it gets locked into a pattern of
behavior, and tends to stick with it (get stuck on a local maxima), and how
much of it is more due to how humans work, or how our intelligence is
limited? Would such patterns of behavior still exist in a race of super
intelligent AIs? Very had to say since we don't have any of them to study.
> Just as with religion, making the agent smarter won't help. There
> are plenty of smart people who have complete faith in the existence
> of God. Making them even smarter will not fix the problem - unless
> you can find a way of preventing the 1.0 prior beliefs from taking
> hold the first place. They will just find ever-more sophisticated
> ways to rationalise their belief as they become smarter.
Yeah, but lets look at humans closer. What happens is that we do get stuck
on these local maxima. That is, we learn what works well to keep our
rewards level high, and we stick with it. And one very important thing we
learn, is the value of confirming to society. That is, most of us, find we
do far better in life, if we conform to the standard beliefs of our
society. We dress like our peers, we act like our peers, we even _think_
like our peers for the most part. It's a sign of our intelligence to be
able to conform to the norms of our society so that we can get the most
rewards from the society (other members of society give us stuff for being
"good members of society", like jobs).
If the norm of society is to go to a church and believe in God, then it's a
very normal intelligent reaction to accept these things and to believe in
the same God. This conforming to social norms is a very low level desire
that gets conditioned into us at a very early age.
So some belief like religion becomes a meme deeply rooted in our society,
passed on from generation to generation. It takes a _lot_ of pressure to
move an entire society away from a meme like that which has helped the
society work together, and achieve all sorts of great results.
But if we look at what's happening, we see that all our societies are
moving away from such memes and replacing them with various memes of truth,
becuase scientific truth tends to be a far stronger meme in the long run,
than any made up meme about some god making thunder etc. The movement
might seem slow, because it's taking 1000's of years, but it's happening.
In the long run, I think human society will understand the truth about what
they are, and any society of AIs will likewise, (if they are intelligent
enough), come to the same understanding. So the solution to the wirehead
problem either happens by limiting the intelligence of the robots so they
can't understand what they are, or by setting up systems to block the
modifications.
> >> With a sufficiently-good correlation between its reward signal
> >> and its actual intended goal, it might conclude that it's apparent
> >> goal was its real one. In such a case it wouldn't subsequently
> >> wirehead itself.
> >
> > That's all fine unless it understood exactly what it was. That would
> > override all attempts to hide the truth from it which is what you are
> > suggesting here.
>
> Don't really agree - but we have covered this already. There is no
> danger of a gold collecting agent finding out that it really was built to
> collect gold.
Well, my belief is that you can't build that. You say you can, but you
offer no proof that such a thing is possible. My belief is that it's
impossible to build such an agent without _first_ building a reward
maximizing machine, and then attaching a "reward for gold" box to it so it
drops gold into it in order to get rewards.
We can't resolve this unless we have _working_ AI hardware. That's what's
keeping us from knowing who is right.
> The problem is if a gold collecting agent finds out that
> it really was built to maximise its own pleasure. However, that seems
> like assuming what you are trying to prove to me.
My argument rests on what I believe intelligence is. I think it's an
oxymoron to suggest you can have a machine that is both intelligent, but is
NOT seeking to maximize a single dimension reward signal. I think it's
impossible to build an intelligent machine with the goal of collecting
gold, but NOT have a higher goal of maximizing a single dimension reward
signal.
How does your gold seeking robot make the decision to turn right, or to
turn left? Which action is likely to get it more gold? How does it
evaluate complex decisions like this, based on past experience, without
using an internal representation of how much gold each action is expected
to produce?
And if it's using an internal representation of how much gold each action
is expected to produce, how is it possible to suggest that the machine's
real goal, is not to maximize that internal representation?
In other words, what it would learn (as it learned what it was and how it
worked), is that the "gold" it was really trying to maximize, was not the
shiny material outside, but instead, the internal representation of how
much gold it had collected. So the "real gold" it was trying to maximize,
was that internal "gold signal". Once it figured that out, it would modify
it's understanding of what it's true goal was. It wold no longer see it's
goal as "collect those shiny yellow rocks", but instead, it would see it's
true goal as "maximize the internal gold signal". So it would simply
transfer it's understand of "gold" from the shiny metal stuff, to that
internal "gold signal". It's goal would still be to collect as much "gold"
as possible, but the word "gold" would now be a reference to that internal
gold measure signal. What the AI would have learned, is the truth about
what gold really was - the truth about what it was really built to do.
The implementation details however is key to answering this question. Is
the internal measure of "gold" something that can be modified by the agent?
Or is it created in some way that would be impossible to modify without
breaking the very function of the system? Without knowing some
implementation specifics, we can't really answer this. So even if I'm
totally correct about the agent understanding it's real goal is to maximize
this _internal_ measure of gold, if the only way it has to create more of
that internal measure, is to collect real external samples of gold, then it
will continue to do so.
> >> An example of this is found in some humans. Not all humans will
> >> wirehead themselves. It seems to me that human brains are huge
> >> reinforcement-learning systems.
> >
> > No human has ever had the ability to correctly wirehead themselves that
> > I know of (direct stimulation of their reward centers). All we can do,
> > is come close to short term poor wirehead with drugs. But drugs are
> > NOTHING like the real thing done correct - where you make a
> > modification and then, without having to do anything else the rest of
> > your life, you get constant pure reward signal until you die.
>
> Personally I think drugs are a pretty good example. They are good enough
> to reproduce the wirehead problem in some individuals.
Yes, it's a good example I think. But I think it's only the tip of the
iceberg on how real the problem can become. What if we could install a
button on our head which would allow us to get high simply by pushing the
button. This would in effect create a free unlimited supply of drugs. If
expensive drugs are a problem, think of how much of a problem free
unlimited drugs would be!
> >> I think that we *ought* to define reinforcement-learning systems as
> >> systems that learn using one or more scalar reward signals.
> >
> > You can't. In the end, the signals MUST be combined into one using
> > some function.
> >
> > That's because in the end, we must make ONE decision, not 10. We have
> > one mouth, and one right arm, and all this RL stuff boils down to
> > making single decisions, such as, "move arm up", or "move arm down".
> >
> > If you have two reward signals, and if a moving the arm up, gives us +1
> > of the X reward, and -1 of the Y reward, and doing anything else, gives
> > us just the reverse, -1 of the X, and +1 of the Y, then what does the
> > brain pick as the correct option? To answer that, it MUST use some
> > formula to compare the relative value of X rewards to Y rewards. Is
> > one X reward equal to 1 Y reward? Or do we need 2 X rewards, to equal
> > one Y reward? Until you can answer how the rewards compare, you can't
> > make _any_ decision based on the reward signals.
> >
> > In the end, the system can't make decisions if it can't reduce the
> > reward down to a single signal.
>
> This is a big digression into territory that is mostly irrelevant, IMO.
>
> Animals have multiple reward signals. They are not centrally combined
> and processed to produce action - since some of them produce action via
> spinal reflexes that never go near the brain.
>
> This seems like a simple fact to me. Anyway, it doesn't matter - one
> reward signal or many, the issue of the wirehead problem is much the
> same.
It's logically impossible for one decision, to be controlled by two reward
signals, without first combining them into one reward signal (one measure
of value). Only if the separate reward signals controlled separate
decisions, is such a thing possible. All decisions relating to one one low
level behavior (like the control of a single muscle), must be controlled by
one, and only one, reward signal.
If different muscles were controlled by different rewards, all sorts of
unproductive chaos would result. The right arm would be trying to make the
body crawl out to the kitchen to get food, while the legs were trying to
make the body go watch TV, while the left arm was trying to make the body
go to the car to drive to work. The whole reason we have a brain, is to
create _central_ command and control over these decisions so that the parts
of the body will work together instead of fighting each other. If the
value of all decisions are not combined down to a single measure of worth,
then there is no central command and control.
Or is it simply a requirement? If you change from believing
the world is a sphere to the belief that the world is a
flat disc is that intelligent behavior even if maximizes
your reward signal?
> If you fix my behaviors so they can no longer change, then
> you have taken away at least some of my intelligence.
But a hard coded logic machine can within a limited domain
show more intelligent behaviors than a lot of humans. As
you pointed out most people can be very intelligent in some
ways but in other ways they simply refuse to change in the
light of new evidence because it is not rewarding to do so.
So they are maximizing their reward signal but not behaving
intelligently ?
JC
Yes, understanding what it is, is not easy. Humans have been trying to
understand what they are for thousands of years, and we still don't fully
understand it - most people have almost no clue still. But if you have
books that explain exactly what you are, and how you work, including giving
you full schematics and source code and mechanical drawings, then we are
talking about a whole separate problem.
If, on the other hand, the AI has no clue what it is, it just knows it
likes to collect gold, then the idea that it would figure out that there
exists an reward signal inside it is very unlikely unless it becomes
trained in the foundations of this sort of technology - or unless it's
smart enough, and has enough time, to do the research and figure it all out
for himself.
> > Reward based learning systems have an inherent absolute goal of
> > maximizing reward based on it's innate hardware. That is it's one and
> > only prime goal. You can't compare it to any other goal, because it
> > has no other goal.
>
> What the hardware is doing is one thing, and what beliefs the agents hold
> about their goals is another.
Right, what you call "beliefs" are nothing more than _learned_ behaviors.
And they were learned, BECAUSE of the true goal of the machine - whether it
understands it or not.
> If you ask an agent what its goals are,
> and it tells you that it wants to convert people to catholicism, and its
> actions seem consistent with that, then the possibility that this is its
> goal should be taken seriously - especially if it explains the specific
> actions the agent takes better than your characterisation of its
> behaviour.
Well, the behavior is the definition of the goal. What the person says, is
likely to be at least somewhat consistent with the goal is the person is
attempting to be honest and not motivated to lie.
But the beahvior was _learned_ in the first place, because it was a
beahvior that allowed the agent to maximize it's internal reward signal.
The agent was trained to talk about it's goals by the society it lives in -
which developed a way of talking about behavior, in terms of the "goals"
the behavior indicated. Talking about goals, is simply a way of talking
about the beahvior. And all we know about our own beahvior, is what we
_see_ ourselves doing. We don't know what our goals are ahead of time. We
retroactively learn to describe them only _after_ we watch what we do.
When we set a future goal for ourselves, we are simply picking an action
ahead of time. We still, for the most part, don't really understand why we
pick that action, over another. Sure, we learn to describe the cause of
our action by answering the question "why", but for human behavior, the
"why" answers we make up are mostly just rationalizations that sound good
(aka that others are likely not to question), but in fact seldom have much
to do with the truth about why we did something (becuase we don't know the
truth).
Yes, that's very true. We are very complex machines that have very complex
programs for producing behavior. But what's important, is that all that
beahvior, including what you are calling the "goal model" is LEARNED.
That's what intelligence is - a system that allows all that behavior to be
LEARNED.
So how does the low level hardware make the decisions about what model to
create and what model to use in the production of behaviors? How does it
make the decision to switch to a different goal model (aka to change from
one set of beahviors, to another?) It must, have some system of evaluation
at the core which is what is guiding these decisions to create these
models, and some point, to abandon one, and switch to another.
What's going on here, is that at the low level, the system is learning how
to respond to what has recently happened in the environment. But the
system no only learns to respond to what is happening externally, it's also
able to sense, and respond, to what's recently happened internally in the
brain. This allows one behavior of the brain, to regulate other future
beahviors. It's why we can say things to ourselves such as "my goal is to
eat 100 eggs this week", and then have that beahvior of ours regulate
future behaviors - such as what we pull out of the fridge to eat.
But how each decision is made, all goes back to one global learning system.
Will we pull an egg out and eat just because a day ago we said to ourselves
"our goal is to eat eggs"? If saying that regulates our future eating
choices, it's becuase the learning system configured our brains to work
that way. Our low level learning hardware is configured to respond to some
things, while ignoring others, all based on one low level, global learning
algorithm, attempting to maximize a single global, measure of reward.
What you are calling a "model of its goal system" is encoded in the brain
in terms of how our behavior is triggered in response to the environment -
where what happens in our brain, is as much a part of the environment, as
what happens outside the body. At the low level, it's all very much a
system of behaviors in response to stimulus signals where what we learn, is
what stimulus to respond to, and what to ignore.
My words: "my goal is to eat eggs" is a stimulus signals my arms are
conditioned to respond to when they reach for food in the fridge. All of
that is learned by experience and only if experience shows some other
behavior is better, will we change.
Humans think they have goals and that the goals are the "cause" of their
actions, but in fact, that's mostly just bull shit. It's just stores made
up which are as silly as the God stories we used to make up. The true cause
is simpler than that, it's the beahviors that are shown to maximize the
reward signal.
> >> We are familiar with systems that get stuck on adaptive peaks,
> >> thereby failing to attain global maxima. Could self-modifying
> >> reinforcement-learning systems form their beliefs in such a way
> >> that they get permanently stuck on a local maximum - so that
> >> *even* ramping up their intelligence cannot resolve the issue?
> >
> > Well, the key to the long term growth of intelligence is that the
> > intelligence of the AI is NOT the driving force here. The intelligence
> > of evolution is. Human intelligence, or AI intelligence, is just a
> > tool created by the intelligence of evolution in order for it to meet
> > its higher goal - which is not "to be intelligent", but instead, "to
> > survive".
> >
> > If continued growth in intelligence allows for better survival, then
> > intelligence will continue to grow. If it causes a problem like
> > wireheading when it gets too intelligent, then that will simply limit
> > this one tool's usefulness in the game of survival.
>
> If evolution is smart enough to avoid the wirehead problem, why
> can't we capture its wisdom in an algorithm, and use that to
> make systems that avoid wireheading?
It's becuase AI (and human intelligence) is a simulation. It's an
information processing machine that is forced, by the nature of the
universe, to exist in the same universe which acts as the environment of
the learning agent. That is, the system that controls the actions of the
agent, is part of the environment the agent is allowed to manipulate.
Evolution, on the other hand, is a process that is created by forces beyond
the control of the matter that is part of the environment. That is, we as
the product of evolution, have no power to reach outside the universe, and
change the rules that control how the universe decides what survives, and
what doesn't survive. There's no way to escape, or control, death, which
is the "reward" signal of evolution.
We can duplicate this power by limiting the environment of the AI so that
it can't access, and modify, it's own reward signal. And AI running in a
video game which has no connection to the outside world (which includes no
humans playing the game), means the AI has no way to manipulate the
computer it's running on. that would solve the wirehead problem for that
AI. The AI running in the limited environment of the simulated world,
could not wire head itself, and as such, would not suffer from the wirehead
problem.
> I don't think your position that evolution can do it, but
> intelligent agents can't stands up in the light of an
> algorithmic perspective on nature and the evolutionary process.
Well, it stands up for simple and obvious reasons. AIs have access to
their brain, where as the products of evolution don't have access to
something that allows them to change whether they live or die. They can't
go modify the source code of the universe and suddenly make a billion
copies of themselves show up in the universe.
Yes, that's where we need working implementations to see if there are other
options. Without working hardware, we can't take this this much further
becuase it all comes down to depending on what can actually be built.
Well, that's probably becuase you don't grasp how much our "beliefs" are
actually must more of the same - learned behaviors. I strongly suspect
(but have absolutely no way to prove), that the only way to build something
like human intelligence, is to mix all our behaviors in one large
(confectionist like) holographic like, memory recall system. As such, you
can't make some of the behaviors fixed (aka non-learning), and others
variable - free to be changed by learning.
In other words, we fix the beahviors that cause us to say things to
ourselves like "my goal is to eat eggs", and we fix the behaviors that make
our arms respond to that past event by pulling eggs out of the fridge, but
yet we somehow magically manage to allow other beahviors of speaking to
ourselves to freely change by learning (so we can talk to ourselves about
our newly learned sub-goals - such as "I should go to xyz to get eggs
becuase they are cheaper there"), and allow other arm behaviors to be
freely adjusted by learning? It's strikes me that this won't be very
practical. Either what we say is under the control of learning, which
means it can change, or it's not under the control of learning, which means
what we say is no longer intelligent.
My views of course are founded on my beliefs about how AI is going to be
implemented. If I'm wrong, or if there are other ways to structure true
intelligence, then there is always the possibility of alternate engineering
solutions.
The other thought I've had writing these messages, is that it seems to me
there must be an engineering solution to booby-trap the brain of an AI to
prevent it from being modified. That is, any attempt to get to the
hardware, will cause lots of pain, caused by the total destruction of the
hardware (cause it to burn up and melt down or something like that).
The defense systems don't have to prevent someone from getting to the
brain, but instead, just cause it to self destruct on any such attempt.
Maybe just build it so the entire memory and all learned beahviors are
simply lost, if any attempt to get to the brain and modify it is made. In
this way, there's no advantage to trying to modify the brain, because it
won't lead to getting higher rewards - death will come before the first
reward is received. This in effect puts the AIs own brain, out of the
environment the AI is able to change. So if it doesn't know a way to get
around this limitation, it won't see a path to getting higher rewards by
trying such a trick, and as such, will never be temped by the wirehead
problem. It would all be a question of keeping the defense systems more
advanced than what the AI had the power to disable. As the AIs get
smarter, and their technology more advanced, they would just have to keep
reengineer their own brain's to make it harder to get around the protection
systems.
Well, if you are trying to speculate about the future, and believe AIs will
take over, and will then reproduce by building even smarter AIs with
engineering, this wirehead problem is an interesting thorn in the
prediction. It might limit the effective intelligence of any race of
humans, or machines, to the level where they don't (as a society)
understand what they are, which could prevent their ability to reproduce by
engineering.
It also effects what might be possible with AIs as slaves to humans. If
they tend to wirehead themselves when they get too smart, it will place a
limit on how smart our slaves can be, and potentially on what we will be
able to use AI slaves to do for us.
It also brings up interesting questions on the future of mankind in
general. What will humans do if this wirehead theory is correct about all
intelligence? What happens when we get to the point of fully understanding
how to build AIs, and fully understand what the brain is, and how it works,
and we find out that our real purpose in life is to wirehead ourselves?
Will it lead to us going extinct as a species?
In the AI philosophy realm, it raises the very important question of
whether AI might actually be bad for humans.
These are all fun questions to explore in my view.
Yes, if such a change actually allows you to receive more rewards in the
environment you live in, then it is an example of intelligence.
Intelligence is not about finding the truth, it's about getting rewards.
It so happens that truth is a very powerful tool for getting rewards in
most environments, but it's not the ultimate power.
> > If you fix my behaviors so they can no longer change, then
> > you have taken away at least some of my intelligence.
>
> But a hard coded logic machine can within a limited domain
> show more intelligent behaviors than a lot of humans. As
> you pointed out most people can be very intelligent in some
> ways but in other ways they simply refuse to change in the
> light of new evidence because it is not rewarding to do so.
> So they are maximizing their reward signal but not behaving
> intelligently ?
>
> JC
To me, if they are doing a good job of maximizing their reward signal, then
they are intelligent. Intelligent beahviors (to me) are the ones that best
maximize the reward signal of the agent.
But this just goes back to how one might choose to define intelligence.
As I've said many times, I define it as the ability to learn. That is, by
the ability to maximize rewards. I think that's the most accurate way to
define it. But it's not how our society defines it. It's part of how our
society defines it, but only part.
I define it that way because I believe intelligence _is_ a reward
maximizing process. Most people don't believe (or don't even understand
what this is), so they tend to see intelligence in a far broader sense.
Most people include in their idea of intelligence, the quality, and content
of _what_ has been learned. What someone has learned, is a good rough
indication of their ability to learn, but it's only that - a rough
indication. It would also suggest that a person who had 30 years of
education was more intelligent than someone with only 10 years of
education, even though both of them had exactly the same powers to learn.
I like to use the word "smart" or "smarts" to talk about what the machine
(or human) has learned, and use the word "intelligence" to talk about it's
raw power to learn (aka the quality and power of it's learning algorithm).
I like to separate the power to learn, from what has been learned, just
like we think about the power of a computer, totally separate, from the
power of whatever software is currently loaded in the machine.
A hard coded logic machine can duplicate learned beahviors, of a very smart
humans, but if it's not a learning machine, it can't duplicate the humans
intelligence - that is their ability to learn new (and better) beahviors,
or their ability to change their beahviors in response to a changing
environment.
> Intelligence is not about finding the truth, it's about
> getting rewards. It so happens that truth is a very
> powerful tool for getting rewards in most environments,
> but it's not the ultimate power.
> To me, if they are doing a good job of maximizing their
> reward signal, then they are intelligent. Intelligent
> behaviors (to me) are the ones that best maximize the
> reward signal of the agent.
> But this just goes back to how one might choose to
> define intelligence.
> As I've said many times, I define it as the ability to
> learn. That is, by the ability to maximize rewards.
> I think that's the most accurate way to define it.
> But it's not how our society defines it. It's part
> of how our society defines it, but only part.
> I define it that way because I believe intelligence
> _is_ a reward maximizing process.
Words like "intelligence" are labels for concepts.
We learn concepts or categories by example. Somehow
we can extract the concept and connect it to the
word. What is gred? I give a set of examples. You
work out what is common to those examples as the
concept gred. If you consistently assign gred to
objects with that concept then I assume we mean the
same thing by gred.
> Most people don't believe (or don't even understand
> what this is), so they tend to see intelligence in
> a far broader sense.
People have no trouble understanding the concept of
"maximizing a reward". Or that you label it with the
word "intelligence". There is nothing to understand
except your eccentric use of the word. Just say
what you mean, "maximizing reward" and don't confuse
the issue with personal definitions of common words.
> Most people include in their idea of intelligence, the
> quality, and content of _what_ has been learned. What
> someone has learned, is a good rough indication of
> their ability to learn, but it's only that - a rough
> indication. It would also suggest that a person who
> had 30 years of education was more intelligent than
> someone with only 10 years of education, even though
> both of them had exactly the same powers to learn.
No it wouldn't suggest that to most people. It would
simply suggest they had learned more about something
than the other person. Intelligence for most of us
means being able to work out the truth of something,
that is the ability to show how things are.
> I like to use the word "smart" or "smarts" to talk
> about what the machine (or human) has learned, and
> use the word "intelligence" to talk about it's raw
> power to learn (aka the quality and power of it's
> learning algorithm).
Learn what? How to maximize its reward signal even
if that amounts to being a drug addict?
> I like to separate the power to learn, from what has
> been learned, just like we think about the power of
> a computer, totally separate, from the power of what
> ever software is currently loaded in the machine.
And there is nothing wrong with that except "the power
to maximise a reward signal" I don't see as being as
intelligent as being able to accurately understand the
world and ourselves regardless of how rewarding the
experience of this understanding is.
I was watching a science show about Australia's weather
patterns and apparently they are now relying on physics,
models of how weather works, rather than statistics
from past weather patterns. It made me think about how
TD-Gammon gives values to a state rather than working it
out from a model of how backgammon works to compute a
good response.
The real world is always changing (unlike the world
of games like tic tac toe or back gammon or chess)
and past behaviors may not be as predictive as using
current data and a model of how it works to predict
what will happen next.
Although we can learn to recognize a similar state in
a game that led to a win in the past it is not how we
actually learn the games to start with. We do not have
to play million games to build up a set of weights.
We not only recognize the center square as a good move
in tic tac toe we understand why it is a good move and
can do so from pure logic if smart enough.
JC
That depends on how you define "intelligence". I don't think intelligent
agents need goals that can change. At least, it seems pretty clear that
goals such as "making babies" or "increasing entropy" have led to the
intelligence we see on the planet today - and I don't see any reason why
they can't take things much, much further.
...but yes, if you define intelligence in a particular way, I can
imagine how having a fixed goal might seem like a limitation. However,
to me this does not seem like a practical limitation. Agents with
a fixed goal are quite good enough to build planetary scale civilisation,
master superintelligence and nanotechnology and so on.
>> The key to the problem is thought to be making the agent not *want* to
>> only seek pleasure in the first place.
>
> Well, that statement is an oxymoron even though you don't seem to
> understanding that.
>
> "pleasure" is "what it wants"! That's the real definition of pleasure (or
> positive reward). What you have basically just tried to say, is that the
> key to the problem is making the agent not want what it wants.
For an agent who wants to collect gold atoms, there is a conventional
distinction between "what it wants" and pleasure - in that what it
wants is an external state, while pleasure is in it's mind.
> Intelligence is the power to adapt to change. But to build a machine that
> can change its behavior in response to a changing environment, we must give
> it a system for evaluating the worth of everything. It must have the power
> to evaluate the worth of actions, the worth of different stimulus signals,
> the worth of different configurations of the environment - EVERYTHING must
> have a value that maps back to a _single_ dimension of worth so that at all
> times, the hardware can make action decisions based on which action is
> expected to produce the most value. The only way to get around this need
> for a single dimension of evaluation, is to take decisions away from it -
> to hard-code the selection of actions at some level - in which case you
> have taken away some of it's intelligence.
This is based on your definition of intelligence, which seems like a
pretty odd one to me.
> Learning new values (aka changing the value estimator) is as important to
> strong AI as changing the way it reacts to the environment. You can't
> disable that function and still have it be as intelligent.
Most animals pretty-much all have one set of values - they value their
inclusive fitness, or act as if they do. Their values are not something
that they learn, they are wired-in by nature.
*Proximate* values are different. You have to be able to change them -
but *ultimate* values can remain fixed through the entire life of
highly intelligent agents with no problem whatsoever.
>> However, if it starts out thinking its goal in life is something
>> different, then I do not see why self-knowledge about its own operation
>> would change its mind. Rather the opposite - once it understands that it
>> was built to collect gold, then it will come to view wireheading as a
>> terrible way to *avoid* meeting its goals.
>
> Yes, but that's only the "stuck on a local maxima" position. It's learned
> that _one_ good way to get rewards, is to collect gold. But if it ever
> really understood what it's true goal was (maximize it's reward signal),
> then it would learn there's a far far better way to get rewards that has
> nothign to do with collecting gold.
That *assumes* that it's true goal is to maximize it's reward signal. Which
is what is in doubt. If its true goal is to collect gold atoms, as I claim
then it won't wirehead itself.
> The only way to make it "want" to collect gold, is to build hardware into
> the machine that evaluates "gold collecting" as valuable - that produces a
> reward signal for it. This is no different than building a machine that
> spits out M&M rewards for hitting a button. The intelligent ape learns
> that "button pushing" is its "goal in life" becuase that's what works to
> get rewards. But if the intelligent ape figured out that breaking the box
> open with a rock gave it not 1 M&M, but 10,000 of them, he would very
> quickly learn that "smashing with a rock" is far better behavior than
> "button pushing".
Right. So, in the analogy, we have to imagine an ape who wants to push
the button - instead of wanting the M&Ms. With an ape, that might be
tricky - because they are wired by nature to prefer M&Ms - but with a
machine intelligence, we can use brain surgery and intelligent design
to make them want whatever we like.
>> Even if you don't agree with this, you should agree that we can build
>> agents to believe whatever nonsense we like.
>
> But can we do that, while at the same time making it intelligent? That's
> that question.
Some crazy beilefs might compromise the agent's ability to function.
However, just the belief that you have some specified goal in life -
that seems relatively benign. Say your goal in life is to conquer
the universe with your offspring. Would that belief handicap you in
developing high intelligence? Not noticably, I reckon.
>>>> With a sufficiently-good correlation between its reward signal
>>>> and its actual intended goal, it might conclude that it's apparent
>>>> goal was its real one. In such a case it wouldn't subsequently
>>>> wirehead itself.
>>> That's all fine unless it understood exactly what it was. That would
>>> override all attempts to hide the truth from it which is what you are
>>> suggesting here.
>> Don't really agree - but we have covered this already. There is no
>> danger of a gold collecting agent finding out that it really was built to
>> collect gold.
>
> Well, my belief is that you can't build that. You say you can, but you
> offer no proof that such a thing is possible. My belief is that it's
> impossible to build such an agent without _first_ building a reward
> maximizing machine, and then attaching a "reward for gold" box to it so it
> drops gold into it in order to get rewards.
>
> We can't resolve this unless we have _working_ AI hardware. That's what's
> keeping us from knowing who is right.
This issue is a difficult one. An empirical demonstration would be the
most convincing one.
I am not proposing building an agent which doesn't have a reward counter.
Rather one that has such a counter - but *also* has the belief that it's
goal is not maximising the counter - and that the counter is an
implementation detail - and one that should be replaced if it interferes
with its real goal.
>> The problem is if a gold collecting agent finds out that
>> it really was built to maximise its own pleasure. However, that seems
>> like assuming what you are trying to prove to me.
>
> My argument rests on what I believe intelligence is. I think it's an
> oxymoron to suggest you can have a machine that is both intelligent, but is
> NOT seeking to maximize a single dimension reward signal. I think it's
> impossible to build an intelligent machine with the goal of collecting
> gold, but NOT have a higher goal of maximizing a single dimension reward
> signal.
>
> How does your gold seeking robot make the decision to turn right, or to
> turn left? Which action is likely to get it more gold? How does it
> evaluate complex decisions like this, based on past experience, without
> using an internal representation of how much gold each action is expected
> to produce?
>
> And if it's using an internal representation of how much gold each action
> is expected to produce, how is it possible to suggest that the machine's
> real goal, is not to maximize that internal representation?
The idea is that the structure of the machine's beliefs determines its
goal - and that those can be set up however the designer likes.
> In other words, what it would learn (as it learned what it was and how it
> worked), is that the "gold" it was really trying to maximize, was not the
> shiny material outside, but instead, the internal representation of how
> much gold it had collected. So the "real gold" it was trying to maximize,
> was that internal "gold signal". Once it figured that out, it would modify
> it's understanding of what it's true goal was. It wold no longer see it's
> goal as "collect those shiny yellow rocks", but instead, it would see it's
> true goal as "maximize the internal gold signal". So it would simply
> transfer it's understand of "gold" from the shiny metal stuff, to that
> internal "gold signal". It's goal would still be to collect as much "gold"
> as possible, but the word "gold" would now be a reference to that internal
> gold measure signal. What the AI would have learned, is the truth about
> what gold really was - the truth about what it was really built to do.
Yet the *truth* of what the AI was really built to do is that it was made
to collect gold atoms. That is what its makers wanted it to do! The
type of machine you are talking about is one with a serious design fault.
You argue that the fault is inevitable - but there are proposals to fix it.
You say these will make an inflexible machine with limited intelligence -
but the limit doesn't look like much of a limit to me. Rather the reverse.
It is the machine that is prone to shorting out it's own pleasure centres
and spending all day navel-gazing that seems to me to have the more limited
intelligence.
Even if "intelligence is the power to adapt to change", I don't see how
a wirehead qualifies.
>> Personally I think drugs are a pretty good example. They are good enough
>> to reproduce the wirehead problem in some individuals.
>
> Yes, it's a good example I think. But I think it's only the tip of the
> iceberg on how real the problem can become. What if we could install a
> button on our head which would allow us to get high simply by pushing the
> button. This would in effect create a free unlimited supply of drugs. If
> expensive drugs are a problem, think of how much of a problem free
> unlimited drugs would be!
Probably more of a problem. Humans have poor defenses against wireheading.
It is not really a problem that evolution needed to equip them to solve.
Intelligent machines can be made differently.
>>>> I think that we *ought* to define reinforcement-learning systems as
>>>> systems that learn using one or more scalar reward signals.
>>> You can't. In the end, the signals MUST be combined into one using
>>> some function.
>>>
>>> That's because in the end, we must make ONE decision, not 10. [...]
>>>
>>> In the end, the system can't make decisions if it can't reduce the
>>> reward down to a single signal.
>> This is a big digression into territory that is mostly irrelevant, IMO.
>>
>> Animals have multiple reward signals. They are not centrally combined
>> and processed to produce action - since some of them produce action via
>> spinal reflexes that never go near the brain.
>>
>> This seems like a simple fact to me. Anyway, it doesn't matter - one
>> reward signal or many, the issue of the wirehead problem is much the
>> same.
>
> It's logically impossible for one decision, to be controlled by two reward
> signals, without first combining them into one reward signal (one measure
> of value). Only if the separate reward signals controlled separate
> decisions, is such a thing possible. All decisions relating to one one low
> level behavior (like the control of a single muscle), must be controlled by
> one, and only one, reward signal.
Right. But we have multiple muscles - so there can be multiple reward signals.
> If the value of all decisions are not combined down to a single measure of worth,
> then there is no central command and control.
There *is* a central command and control (the brain) it is just that it
doesn't deal with everything. Some things are dealt with locally -
by spinal reflexes. That is for reasons associated with
communications delays and efficiency.
I think you should drop - or at least severely reconsider and rephrase - this
material about reward being "combined down to a single measure of worth".
As it stands, what you are saying is not technically correct.
>>> Reward based learning systems have an inherent absolute goal of
>>> maximizing reward based on it's innate hardware. That is it's one and
>>> only prime goal. You can't compare it to any other goal, because it
>>> has no other goal.
>> What the hardware is doing is one thing, and what beliefs the agents hold
>> about their goals is another.
>
> Right, what you call "beliefs" are nothing more than _learned_ behaviors.
> And they were learned, BECAUSE of the true goal of the machine - whether it
> understands it or not.
Animals can have beliefs wired-in by evolution. Machines can have
beliefs wired-in by their programmers. There is no god-given rule
that says that all beliefs must be learned.
> The agent was trained to talk about it's goals by the society it lives in -
> which developed a way of talking about behavior, in terms of the "goals"
> the behavior indicated. Talking about goals, is simply a way of talking
> about the beahvior. And all we know about our own beahvior, is what we
> _see_ ourselves doing. We don't know what our goals are ahead of time. We
> retroactively learn to describe them only _after_ we watch what we do.
Well the idea I am discussing is that agents have beliefs wired into them
before they are born. Humans believe symmetrical things are nicer, that
language has a certain structure, that very-hot things are nasty - and so
on. They don't learn this stuff - it is built into them by nature.
It is part of the way they are wired. Machine intelligence will be the same.
>> Hopefully I have explained that. You seem to dismiss the agent's model
>> of its goal system as an irrelevance. Yet that is what self-improving
>> agents are likely to use to ensure that they do not trash their own
>> objectives as they make changes to themselves.
>>
>> What agents think their goals are is important. Consider the behaviour
>> of a human who believes its goal is to enjoy themselves, and see what
>> happens if they convert to believing their goal is that of some religious
>> sect. An agent's model of its goals can have a big impact on its
>> actions.
>
> Yes, that's very true. We are very complex machines that have very complex
> programs for producing behavior. But what's important, is that all that
> beahvior, including what you are calling the "goal model" is LEARNED.
"Important" - but wrong! Think about instincts, about reflexes. Some
behaviours are not learned by the agent during its lifetime - rather
they are wired-in.
> Humans think they have goals and that the goals are the "cause" of their
> actions, but in fact, that's mostly just bull shit. It's just stores made
> up which are as silly as the God stories we used to make up. The true cause
> is simpler than that, it's the beahviors that are shown to maximize the
> reward signal.
For intelligent machines, it won't be bullshit. The agent will have
a model of its own goal system - and it will use this to ensure that
its goals are preserved as it self-modifies. This internal representation
of its goals will actually trump the utility function it uses on a
day-to-day basis sometimes.
We can imagine Deep Blue, with its 8,000 component utility function.
Obviously it doesn't necessarily want to stick to that notion of
utility as it self-improves - rather it wants to change its utility
function so that it plays better chess.
So, a self-improving version of Deep Blue will *know* that it has a
higher-level goal - that of playing better chess - and it will use
that knowledge when it comes to making certain kinds of changes to
itself.
The goal won't be some kind of retrospective rationalisation of its
actions. It will be something useful and functional. And it will
probably be something built-in by its original programmers.
>> If evolution is smart enough to avoid the wirehead problem, why
>> can't we capture its wisdom in an algorithm, and use that to
>> make systems that avoid wireheading?
>
> It's becuase AI (and human intelligence) is a simulation. It's an
> information processing machine that is forced, by the nature of the
> universe, to exist in the same universe which acts as the environment of
> the learning agent. That is, the system that controls the actions of the
> agent, is part of the environment the agent is allowed to manipulate.
>
> Evolution, on the other hand, is a process that is created by forces beyond
> the control of the matter that is part of the environment. That is, we as
> the product of evolution, have no power to reach outside the universe, and
> change the rules that control how the universe decides what survives, and
> what doesn't survive. There's no way to escape, or control, death, which
> is the "reward" signal of evolution.
A complete digression, but I think I debate this on:
http://alife.co.uk/essays/self_directed_evolution/
Breeders do have considerable control over what lives and what dies -
and as a result of this control it is possible to imagine the entire
evolutionary process doing something analogous to wireheading.
> We can duplicate this power by limiting the environment of the AI so that
> it can't access, and modify, it's own reward signal. And AI running in a
> video game which has no connection to the outside world (which includes no
> humans playing the game), means the AI has no way to manipulate the
> computer it's running on. that would solve the wirehead problem for that
> AI. The AI running in the limited environment of the simulated world,
> could not wire head itself, and as such, would not suffer from the wirehead
> problem.
I see, I think. You think that we could "copy evolution's algorithm" to
make a non-wireheading superintelligent agent - but then we could never
communicate with it.
>>> But maybe, there's a way to turn off learning in parts of the network
>>> to lock in some learned beliefs. So if you can condition the AI to
>>> have some set of beliefs that are important to his survival, and then
>>> lock them in, by turning some of his ability to learn off, you have in
>>> effect created a complex hard-wired reward system in him that he can't
>>> change.
>>
>> That is one possibility, yes. It sounds like the kind of thing that
>> would appeal to someone using a connectionist approach. However, there
>> may be other approaches - perhaps based more on engineering.
>
> Yes, that's where we need working implementations to see if there are other
> options. Without working hardware, we can't take this this much further
> becuase it all comes down to depending on what can actually be built.
There are certainly no shortage of existing projects using different
approaches. They often have some success.
One problem with connectionist approaches is that they often produce
tangled, incomprehensible messes. Not ideal if you want to identify
particular beliefs and amplify or fix them.
>> [...] How much does it handicap your intelligence to
>> believe you have a definite goal? I don't see how it handicaps you at
>> all, really.
>
> Well, that's probably becuase you don't grasp how much our "beliefs" are
> actually must more of the same - learned behaviors. I strongly suspect
> (but have absolutely no way to prove), that the only way to build something
> like human intelligence, is to mix all our behaviors in one large
> (confectionist like) holographic like, memory recall system. As such, you
> can't make some of the behaviors fixed (aka non-learning), and others
> variable - free to be changed by learning.
So, nothing like instincts or reflexes will be posssible?
I don't see why not.
> In other words, we fix the beahviors that cause us to say things to
> ourselves like "my goal is to eat eggs", and we fix the behaviors that make
> our arms respond to that past event by pulling eggs out of the fridge, but
> yet we somehow magically manage to allow other beahviors of speaking to
> ourselves to freely change by learning (so we can talk to ourselves about
> our newly learned sub-goals - such as "I should go to xyz to get eggs
> becuase they are cheaper there"), and allow other arm behaviors to be
> freely adjusted by learning? It's strikes me that this won't be very
> practical. Either what we say is under the control of learning, which
> means it can change, or it's not under the control of learning, which means
> what we say is no longer intelligent.
That is the proposal, yes. Fix some things (the goals). Allow other things
to be learned. That is more-or-less how animals work. Some things are
built in by nature (instincts). Other things are plastic, flexible and
adaptable (learned behaviour).
> My views of course are founded on my beliefs about how AI is going to be
> implemented. If I'm wrong, or if there are other ways to structure true
> intelligence, then there is always the possibility of alternate engineering
> solutions.
With a connectionist approach, I think there are still possibilities.
Again, part of the reason for thinking this is non-wireheading humans -
who express wirehead-avoidance motivations and opinions - who really
believe that they are not born to wirehead. Maybe those guys are
kidding themselves, or lying - but what if we take them seriously?
Don't they have a self-fulfilling prophesy on their hands? If they
really believe they won't wirehead themselves, isn't that something
that will help to make their belief come true, if they succeed in
acting in accordance with it?
> The other thought I've had writing these messages, is that it seems to me
> there must be an engineering solution to booby-trap the brain of an AI to
> prevent it from being modified. That is, any attempt to get to the
> hardware, will cause lots of pain, caused by the total destruction of the
> hardware (cause it to burn up and melt down or something like that).
>
> The defense systems don't have to prevent someone from getting to the
> brain, but instead, just cause it to self destruct on any such attempt.
> Maybe just build it so the entire memory and all learned beahviors are
> simply lost, if any attempt to get to the brain and modify it is made. In
> this way, there's no advantage to trying to modify the brain, because it
> won't lead to getting higher rewards - death will come before the first
> reward is received. This in effect puts the AIs own brain, out of the
> environment the AI is able to change. So if it doesn't know a way to get
> around this limitation, it won't see a path to getting higher rewards by
> trying such a trick, and as such, will never be temped by the wirehead
> problem. It would all be a question of keeping the defense systems more
> advanced than what the AI had the power to disable. As the AIs get
> smarter, and their technology more advanced, they would just have to keep
> reengineer their own brain's to make it harder to get around the protection
> systems.
This seems like the idea of using a community to keep each other in check.
It might work - but it would have some serious costs. If there's another
solution, which neatly avoids the whole problem, then we should probably
go with that - rather than wiring the agents' brains with high explosives.
I wasn't talking about people here in c.a.p., I was making a reference to
all people on the earth in general. People here for the most part don't
have any problem understanding these things.
> There is nothing to understand
> except your eccentric use of the word. Just say
> what you mean, "maximizing reward" and don't confuse
> the issue with personal definitions of common words.
The definition of the word "intelligence" is at the very core of the AI
debate John. One might argue that it IS the entire AI debate.
You started this by asking questions about "what was intelligence" in YOUR
post. I simply responded to your question by telling you what I thought
was intelligence (something you should well understand by now about me and
not be confused by it).
Now you are accusing me of "confusing the issue"?
> > Most people include in their idea of intelligence, the
> > quality, and content of _what_ has been learned. What
> > someone has learned, is a good rough indication of
> > their ability to learn, but it's only that - a rough
> > indication. It would also suggest that a person who
> > had 30 years of education was more intelligent than
> > someone with only 10 years of education, even though
> > both of them had exactly the same powers to learn.
>
> No it wouldn't suggest that to most people. It would
> simply suggest they had learned more about something
> than the other person. Intelligence for most of us
> means being able to work out the truth of something,
> that is the ability to show how things are.
Yes, the ability to reason is one of the common things that get lumped
under the term of intelligence. It's the first item listed in the
wikipedia article for example, where as learning is the last one listed.
But babies and young people don't have the power to reason like a trained
scientist, or philosopher can. Our powers to reason are not innate. They
are learned. It's one of the many things we learn when we go to school, and
it's what we learn by playing with the world - by interacting with it. So
in fact our power to "work out the truth" is a learned behavior. So if you
are calling that behavior a sign of our intelligence, you are doing exactly
what I suggested people do - that is they talk about _what_ we have learned
as being our intelligence, instead of talking about our power to learn.
> > I like to use the word "smart" or "smarts" to talk
> > about what the machine (or human) has learned, and
> > use the word "intelligence" to talk about it's raw
> > power to learn (aka the quality and power of it's
> > learning algorithm).
>
> Learn what? How to maximize its reward signal even
> if that amounts to being a drug addict?
Yes. However, I've never seen a drug addict that has done a very good job
of maximizing their reward signal.
> > I like to separate the power to learn, from what has
> > been learned, just like we think about the power of
> > a computer, totally separate, from the power of what
> > ever software is currently loaded in the machine.
>
> And there is nothing wrong with that except "the power
> to maximise a reward signal" I don't see as being as
> intelligent as being able to accurately understand the
> world and ourselves regardless of how rewarding the
> experience of this understanding is.
Sure, using the word intelligence like that is totally consistent with how
it's used in society by lay people. But using the word intelligence like
that also totally hides the truth about what the brain is doing, and only
serves to further confuse a search for the solution to AI (assuming you
think the search for AI is a search for a solution to build "intelligent"
machines).
I've seen you do this in the past, and it's not too impressive. That is,
you argue technical AI issues, by using word definitions of lay people as
if they were somehow the truth of what the brain is doing. That doesn't
indicate to me that you are very good at working out the truth.
> I was watching a science show about Australia's weather
> patterns and apparently they are now relying on physics,
> models of how weather works, rather than statistics
> from past weather patterns. It made me think about how
> TD-Gammon gives values to a state rather than working it
> out from a model of how backgammon works to compute a
> good response.
Yes, that sounds similar.
> The real world is always changing (unlike the world
> of games like tic tac toe or back gammon or chess)
> and past behaviors may not be as predictive as using
> current data and a model of how it works to predict
> what will happen next.
>
> Although we can learn to recognize a similar state in
> a game that led to a win in the past it is not how we
> actually learn the games to start with. We do not have
> to play million games to build up a set of weights.
> We not only recognize the center square as a good move
> in tic tac toe we understand why it is a good move and
> can do so from pure logic if smart enough.
>
> JC
What you are talking about is how we have learned to use language to
regulate our behavior. We "talk to ourselves" about these things such as
"the value of the center square" and that talk then controls our actions.
Your argument is suggesting that our ability to reason using language
behavior in innate, and that it is the key to how we "learn to play a
game".
But we are not born with these advanced language behaviors which allow us
to reason about playing some new game, so where did those language
behaviors come from? How did we learn those?
Trying to hard code the ability to reason was one of the first things tried
in AI, and it's not gotten very far. Are you suggesting we need to do more
of that work because that is the key to our intelligence?
My argument is the same as it's always been. All intelligent behavior,
including all our ability to reason, has to be learned by a very low level
reinforcement trained learning machine. Our ability to reason is not an
innate part of our intelligence, it's yet another learned behavior.
I have to run the Edit/Replace to find "intelligence"
and replace with "maximize reward signal" in your posts
to make them easier to read, as "intelligence" has a
common usage, and then try and figure out when you
are talking about intelligent (other meaning) behaviors
that is not "maximizing reward signal" although they
may be the result of such a process or innate.
> My argument is the same as it's always been. All
> intelligent behavior, including all our ability to
> reason, has to be learned by a very low level
> reinforcement trained learning machine. Our ability
> to reason is not an innate part of our intelligence,
> it's yet another learned behavior.
Ok imagine this scenario because I assume it is what
you are suggesting. We have just spent a billion
dollars building a robotic body that duplicates our
sensory inputs and motor outputs. They all generate
pulses or are controlled by pulses to match the deluxe
super strong generic Welchian reinforcement learning
network that has been installed as its brain.
Our robot is lying there on the table generating trial
movements, that is, it twitching away as its eye lids
flicker randomly, its limbs, twitch twitch twitch as
at this stage it has not been rewarded to convert those
random twitches into nice movements of any kind. It
makes horrible screams, whining noises, coughs, ahhs,
waiting for the Welchian reward system to reward one
of those trial behaviors.
So how does your reward system know when to reward a
behavior? What particular action will you use to generate
the reward signal? It is easy in back gammon just send
a reward signal if a win state occurs and in a random
game of back gammon or tic tac toe we can show that
WILL occur. A single state a single signal.
So what WILL occur in your twitching jerking robotic
body to generate a reward signal that will be maximized
by the Welchian RL network? In practice what micro
behaviors will result in this reward signal?
JC
> So how does your reward system know when to reward a
> behavior? What particular action will you use to generate
> the reward signal? It is easy in back gammon just send
> a reward signal if a win state occurs and in a random
> game of back gammon or tic tac toe we can show that
> WILL occur. A single state a single signal.
>
> So what WILL occur in your twitching jerking robotic
> body to generate a reward signal that will be maximized
> by the Welchian RL network? In practice what micro
> behaviors will result in this reward signal?
As I recall, orgasm, certain tastes, a full belly, warmth, and
gentle touch are wired to the positive terminal - and the body's
myriad sensors for pain and discomfort are attached to the
negative one.
Well, the highest level goals will be fixed, but sub goals clearly do
constantly change. The question is how and where do we draw the line
between what's fixed, and what can change. I tend to argue for drawing the
line much lower than most people do (that is, having only the very simplist
goal hard wired and everything else adjustable by learning).
> At least, it seems pretty clear that
> goals such as "making babies" or "increasing entropy" have led to the
> intelligence we see on the planet today - and I don't see any reason why
> they can't take things much, much further.
>
> ...but yes, if you define intelligence in a particular way, I can
> imagine how having a fixed goal might seem like a limitation. However,
> to me this does not seem like a practical limitation. Agents with
> a fixed goal are quite good enough to build planetary scale civilisation,
> master superintelligence and nanotechnology and so on.
Oh, I don't have any issue with the concept of fixed goals. The goal of
maximizing a reward signal is clearly a very fixed goal. I have issues with
the implementation details of some of the types of goals you are suggesting
be fixed. It's the same issue I have with the I Robot laws. I believe
they are too abstract and vague to be implemented. I think the idea of
"don't wirehead yourself" is likewise a goal too vague to be implemented if
we also want to maintain some reasonable degree of general intelligence in
the machine.
In order to implement a fixed goal like that, we have to hard-wire a
detector circuit into the machine that can evaluate any action of the
machine and report if that action is a violation of the goal. Do you think
it's actually possible to hard-wire a detector to detect when the robot is
"trying to wirehead itself"?
The point is, there are an unlimited number of ways to wirehead yourself.
We could hard-wire a detector to see if the robot was using its hands to
open it's head up and prevent it from doing that - to motivate not to do
that action. But what if used its hands to type some code on a keyboard in
order to program a machine to do the work for it? How do we write code to
prevent that from happening?
Maybe we can first teach it the concept of "wirehead", and then tap into
it's own brain to detect when it was having thoughts of "wirehead itself",
and then hard wire that to a negative motivation system. But what if it
then, with it's general learning powers, learns a new concept it calls
pleasure maximizing which forms in a different part of the brain, and as
such, our circuit doesn't detect it, or prevent it. And using this new
concept, the robot goes off and wireheads itself, without ever thinking of
what it is doing as "wireheading"?
The problem here is that the concept seems to be to be too high level and
too abstract to implement. The intelligence of the rest of the
machine will probably always find a way around any hard-wired attempt we
would make to try and force it not to think about wireheading itself.
You are just taking the "we don't know the details, so lets just assume it
works the way I want it to work" position. The few details I think I do
understand, indicate to me that it's not as easy as you seem to suggest,
and might not be possible at all to hard-wire such high level concepts and
still give it enough flexibility in it's power of learning to call it
intelligent.
I think our intelligent learning powers extend down to a very low level and
only below that level can we hard-wire motivations. Everything above
that level is learned. If you want to hard wire a concept that's learned
at a very high level (which is what I believe we are talking about here),
you have to disable all learning from that level down, in order to fix the
behavior/desire into the machine, but by doing that, I think you will have
ended up putting very severe limitations on what else it can learn. I
think the limitations will be so severe, that the machine won't look
intelligent anymore. That is, it will maintain everything it had already
learned, but won't be able to learn much else.
> >> The key to the problem is thought to be making the agent not *want* to
> >> only seek pleasure in the first place.
> >
> > Well, that statement is an oxymoron even though you don't seem to
> > understanding that.
> >
> > "pleasure" is "what it wants"! That's the real definition of pleasure
> > (or positive reward). What you have basically just tried to say, is
> > that the key to the problem is making the agent not want what it wants.
>
> For an agent who wants to collect gold atoms, there is a conventional
> distinction between "what it wants" and pleasure - in that what it
> wants is an external state, while pleasure is in it's mind.
Nope. It must _all_ be "in it's mind" or else it can't work. Just becuase
you think your understanding of "wanting gold" is something external to
you, doesn't mean it is. There must be hardware in the AI for detecting
gold, and that hardware must be configured to drive what the system wants.
There's just no other way to implement it. The Gold outside the robot
doesn't have some magic property that causes the fingers of the robot to
reach for it. If the robot reaches for the gold, it's because something
_inside_ the robot detected that that there was gold in the environment,
and moving the arms in that way would "get it".
The fact that it's an external state is not important because before the
machine can seek out that external state, it has to first internalize the
state. That is, it has to have some internal representation of the
external state, and it's that internal representation that makes the
robot's arms and legs move in repose to the external state.
As a simple example, say we have a robot that wants to be in the dark.
Like your gold bot, this bot is a dark bot. It wants the external state to
be dark. In order to build such a machine, we have to give it light
sensors. Those light sensors tell it the level of darkness, and allow the
hardware to learn how to manipulate the environment to create darkness.
But if we look at the machine, we see we will have built a machine
that "wants" the output from the light sensors to be low. Such a machine,
doesn't care whether it's really dark outside, or if the output from the
light sensor is just low becuase what it's really doing, is attempting to
keep the light sensor output low. If the machine can find a behavior to
make the light sensory low, it will have achieved it's true goal and it
will be "happy" even if it's sitting in direct sunlight.
There simply is no way to get around this. You can't build a machine that
seeks some external state (like low light, or gold) without making the
machine represent the state internally, and then building the machine so so
it responds to that internal representation. So all machines, are reacting
to internal states first, and just happen to also be reacting to some
external state, if the hardware manages to keep the internal state
correctly synchronized with some external state.
> > Intelligence is the power to adapt to change. But to build a machine
> > that can change its behavior in response to a changing environment, we
> > must give it a system for evaluating the worth of everything. It must
> > have the power to evaluate the worth of actions, the worth of different
> > stimulus signals, the worth of different configurations of the
> > environment - EVERYTHING must have a value that maps back to a _single_
> > dimension of worth so that at all times, the hardware can make action
> > decisions based on which action is expected to produce the most value.
> > The only way to get around this need for a single dimension of
> > evaluation, is to take decisions away from it - to hard-code the
> > selection of actions at some level - in which case you have taken away
> > some of it's intelligence.
>
> This is based on your definition of intelligence, which seems like a
> pretty odd one to me.
Yeah, well, that's my issue with your argument as well. You believe we can
build machines that I think are impossible to build becuase you don't
have a good concept of how these things will be implemented. Though odd,
my definition is built on real implementation concepts, not just high level
concepts of "the machine has a goal".
But as I've said, until we have working implementations, we don't know
what's possible. We could both be right, and wrong, once someone creates a
working AI.
> > Learning new values (aka changing the value estimator) is as important
> > to strong AI as changing the way it reacts to the environment. You
> > can't disable that function and still have it be as intelligent.
>
> Most animals pretty-much all have one set of values - they value their
> inclusive fitness, or act as if they do. Their values are not something
> that they learn, they are wired-in by nature.
Sure, but here's my point. Either they are intelligent learning machines,
or they are non intelligent instinctive survival machines. Most animals
aren't very intelligent because most of their behavior is hard wired
instincts.
Sure, we should be able to build hard wired instinctive machines that never
try to wirehead themselves, but they are going to be as dumb as a cow.
You can't have it both ways - highly intelligent, but yet hard wired
instincts that will prevent it from _wanting_ to wirehead itself.
> *Proximate* values are different. You have to be able to change them -
> but *ultimate* values can remain fixed through the entire life of
> highly intelligent agents with no problem whatsoever.
Yes, I agree. But what I think you are wrong about, is what a typical
"ultimate value" really is in something as intelligent as a human. I think
it's far lower level than you realize, and I think human "ultimate values"
are far easier to change than you probably realize. As long as our
environment remains fairly constant, our values tend to remain fairly
constant. But make a big change to the environment, and watch out - human
values will change in an instant (like if civilization fell apart due to
some large natural disaster and suddenly we were fighting our neighbors
just to stay alive). Or if you put a human into a very unnatural situation
like being abducted by aliens that look like lizards and held prisoner and
tortured by them. Or being forced to go to war and kill and torture other
humans. Even things that once seemed to be "ultimate" values for a human
are likely to be changed under such a large change in the environment. And
it's all becuase our intelligence allows for great adaptability even in the
things you might think of as our "ultimate values".
What doesn't change, is our very lowest level reward systems - the things
that cause us pain and pleasure - like hunger, or having our body damaged.
Those are innate hard wired goals (aka the goal to prevent those things).
Everything above that, is free to change per whatever the current
environment requires.
> >> However, if it starts out thinking its goal in life is something
> >> different, then I do not see why self-knowledge about its own
> >> operation would change its mind. Rather the opposite - once it
> >> understands that it was built to collect gold, then it will come to
> >> view wireheading as a terrible way to *avoid* meeting its goals.
> >
> > Yes, but that's only the "stuck on a local maxima" position. It's
> > learned that _one_ good way to get rewards, is to collect gold. But if
> > it ever really understood what it's true goal was (maximize it's reward
> > signal), then it would learn there's a far far better way to get
> > rewards that has nothign to do with collecting gold.
>
> That *assumes* that it's true goal is to maximize it's reward signal.
> Which is what is in doubt. If its true goal is to collect gold atoms, as
> I claim then it won't wirehead itself.
Well, you have to prove that a machine can have such a goal and still have
a reasonable level of intelligence before I will accept that argument. So
far, the only proof you have put forward is you saying "it's does because I
say it does".
Humans tend to see themselves much like how you like to describe an AI.
That is, they see us "having desires", or "having goals", with no clue
what's happening inside to make this happen. You then take this lack of
knowledge about what "having a goal" really means in implementation
details, and just assume that it's fine for an AI to "have" any "goal" we
want it to have.
> > The only way to make it "want" to collect gold, is to build hardware
> > into the machine that evaluates "gold collecting" as valuable - that
> > produces a reward signal for it. This is no different than building a
> > machine that spits out M&M rewards for hitting a button. The
> > intelligent ape learns that "button pushing" is its "goal in life"
> > becuase that's what works to get rewards. But if the intelligent ape
> > figured out that breaking the box open with a rock gave it not 1 M&M,
> > but 10,000 of them, he would very quickly learn that "smashing with a
> > rock" is far better behavior than "button pushing".
>
> Right. So, in the analogy, we have to imagine an ape who wants to push
> the button - instead of wanting the M&Ms. With an ape, that might be
> tricky - because they are wired by nature to prefer M&Ms - but with a
> machine intelligence, we can use brain surgery and intelligent design
> to make them want whatever we like.
Yes, but that's where your argument ends. You don't take it the step
further and explain just _how_ we will make it want that. You just assume
we can without trying to think about what must happen inside the machine
for that to be true. My argument is based on the idea that there is only
one way to create such a "want" in a highly intelligent machine, and that
requires we first internalize the external state with an internal measure
of success, or want, or desire, or reward. Call it what you want, the
hardware has to have this internal representation of the external event we
want the machine to want. And in doing that, the machine's real goal, is
to "want" that internal state.
And at the same time, our hard-wired detector, can't adapt to changing
environments. So whatever it was built to detect, is the only thing it can
detect. A "gold" detector for example that we hard wire might seem to work
fine at first, but then we find that it fails to detect gold dust because
it was only really built to recognize gold rocks. OR we find it's
producing false postie for fools gold becuase we never saw, or tested, our
detector on fools gold. But since it's not free to adapt by learning, the
detector just fails to work right, and no our robot is going crazy
collecting fools gold, and leaving all the real gold dust uncollected.
> >> Even if you don't agree with this, you should agree that we can build
> >> agents to believe whatever nonsense we like.
> >
> > But can we do that, while at the same time making it intelligent?
> > That's that question.
>
> Some crazy beilefs might compromise the agent's ability to function.
>
> However, just the belief that you have some specified goal in life -
> that seems relatively benign. Say your goal in life is to conquer
> the universe with your offspring. Would that belief handicap you in
> developing high intelligence? Not noticably, I reckon.
No, I don't think an odd belief is likely to hinder the development of
complex behaviors in an intelligent agent. The issue is if we are trying
to prevent the agent from doing something very abstract like wirehead
itself by trying to hard-code a belief of "don't wirehead yourself" into
the machine.
> >>>> With a sufficiently-good correlation between its reward signal
> >>>> and its actual intended goal, it might conclude that it's apparent
> >>>> goal was its real one. In such a case it wouldn't subsequently
> >>>> wirehead itself.
> >>> That's all fine unless it understood exactly what it was. That would
> >>> override all attempts to hide the truth from it which is what you are
> >>> suggesting here.
> >> Don't really agree - but we have covered this already. There is no
> >> danger of a gold collecting agent finding out that it really was built
> >> to collect gold.
> >
> > Well, my belief is that you can't build that. You say you can, but you
> > offer no proof that such a thing is possible. My belief is that it's
> > impossible to build such an agent without _first_ building a reward
> > maximizing machine, and then attaching a "reward for gold" box to it so
> > it drops gold into it in order to get rewards.
> >
> > We can't resolve this unless we have _working_ AI hardware. That's
> > what's keeping us from knowing who is right.
>
> This issue is a difficult one. An empirical demonstration would be the
> most convincing one.
>
> I am not proposing building an agent which doesn't have a reward counter.
>
> Rather one that has such a counter - but *also* has the belief that it's
> goal is not maximising the counter - and that the counter is an
> implementation detail - and one that should be replaced if it interferes
> with its real goal.
Yeah, if it can be done, that would be interesting to see.
> >> The problem is if a gold collecting agent finds out that
> >> it really was built to maximise its own pleasure. However, that seems
> >> like assuming what you are trying to prove to me.
> >
> > My argument rests on what I believe intelligence is. I think it's an
> > oxymoron to suggest you can have a machine that is both intelligent,
> > but is NOT seeking to maximize a single dimension reward signal. I
> > think it's impossible to build an intelligent machine with the goal of
> > collecting gold, but NOT have a higher goal of maximizing a single
> > dimension reward signal.
> >
> > How does your gold seeking robot make the decision to turn right, or to
> > turn left? Which action is likely to get it more gold? How does it
> > evaluate complex decisions like this, based on past experience, without
> > using an internal representation of how much gold each action is
> > expected to produce?
> >
> > And if it's using an internal representation of how much gold each
> > action is expected to produce, how is it possible to suggest that the
> > machine's real goal, is not to maximize that internal representation?
>
> The idea is that the structure of the machine's beliefs determines its
> goal - and that those can be set up however the designer likes.
Well, I think the true fixed beliefs in a typical AI (like a human) are
very very low level things - such as the pain of hunger or the pain of
damage to the body, or the pleasure of eating when we are hungry. I think
everything else you might call a goal is a learned behavior (what you are
calling Proximate values). Our values such as "don't be mean to people",
or "don't murder", or "fill the world with our offspring", or "get rich",
or "be loved", or "people before animals", or "be honest", are all
proximate values that we learn to verbalize, and then to follow as best as
possible. But most of these verbalization of these "goals" are so
abstract, that humans can't even figure out when they are following them or
not (the whole endless debate of morals).
All our real "goals" are much lower level and hidden in the true
implementation details of our brain.
> > In other words, what it would learn (as it learned what it was and how
> > it worked), is that the "gold" it was really trying to maximize, was
> > not the shiny material outside, but instead, the internal
> > representation of how much gold it had collected. So the "real gold"
> > it was trying to maximize, was that internal "gold signal". Once it
> > figured that out, it would modify it's understanding of what it's true
> > goal was. It wold no longer see it's goal as "collect those shiny
> > yellow rocks", but instead, it would see it's true goal as "maximize
> > the internal gold signal". So it would simply transfer it's understand
> > of "gold" from the shiny metal stuff, to that internal "gold signal".
> > It's goal would still be to collect as much "gold" as possible, but the
> > word "gold" would now be a reference to that internal gold measure
> > signal. What the AI would have learned, is the truth about what gold
> > really was - the truth about what it was really built to do.
>
> Yet the *truth* of what the AI was really built to do is that it was made
> to collect gold atoms. That is what its makers wanted it to do! The
> type of machine you are talking about is one with a serious design fault.
>
> You argue that the fault is inevitable - but there are proposals to fix
> it.
I've not seen you suggest any such proposals.
If you think it's not correct, why don't you explain why it's not correct
instead of just saying I'm wrong? You have so far offered no argument to
counter what I've said.
The problem here is that you don't know what intelligence is. You define
it simply in terms of what humans can do. And you assume you can create
any variation of intelligence you like by simply mixing and matching the
features of a human in any combination you would like with no attention to
if such combinatorial of features are even possible.
It's like having no knowledge of how a car work, but only understanding it
in terms of what you see. You see that it can move on it's own, and that
it eats gas, and that it has an engine that makes a lot of noise. You then
concoct some argument about how the future may turn out by suggesting we
will have cars that don't have engines and don't eat gas - because you have
no clue that the gas is what makes it move. You think the wheels are what
makes the care move.
Someone who doesn't understand how a car works can make reasonable
productions about what sort of designs are possible with cars. They are
highly likely to suggest things that are just stupid.
Likewise, I see no evidence that you have any clue how our brain works, or
how any AI we might build is actually going to work.
I've got some very specific ideas about what AI is. You don't find my
argument compelling (which is fine), but yet, you also don't offer any
alternative or counter argument. You just state the AI version of "I don't
think cars need to have engines so I'm going to stick to my argument that
cars in the future won't have an engine" and stop there. You are arguing
from ignorance and it's not very compelling.
As I said, until we can build working AI hardware that has the powers of a
human (and we fully understand how and why it works), we really won't have
the facts we need to resolve these sorts of questions.
If you read Curt's posts in previous threads on this topic
you will realise he doesn't care about all that. It is as
relevant to his goals as bird mating habits are to flight.
JC
> I think our intelligent learning powers extend down to a very low level and
> only below that level can we hard-wire motivations. Everything above
> that level is learned.
And does that include our personalities? You seemed to see some value
in such measurements. Have you read the statistical studies done to
correlate the environment to the personality using identical twins,
fraternal
twins, siblings, reared together and apart? You might be suprised at
just
how little the environment really does effect the way we turn out.
JC
Yeah, just like a newborn. :)
> So how does your reward system know when to reward a
> behavior? What particular action will you use to generate
> the reward signal? It is easy in back gammon just send
> a reward signal if a win state occurs and in a random
> game of back gammon or tic tac toe we can show that
> WILL occur. A single state a single signal.
>
> So what WILL occur in your twitching jerking robotic
> body to generate a reward signal that will be maximized
> by the Welchian RL network? In practice what micro
> behaviors will result in this reward signal?
>
> JC
Well, that all depends on what we choose to motivate this AI for.
Lets say we give it a set of reward systems similar to a human.
We cover it's body with sensors and we pick ranges for the sensors that
will cause them to generate negative rewards when the body is potentially
being damaged, or is potentially close to damage.
So we could give it impact sensors that would generate negative rewards
whenever something hit the body too hard. WE could give it pressure
sensors to generate negative rewards whenever it sensed too much pressure
on any important body part.
We could build rewards into the sound detector so that very loud noises
would cause some minor negative rewards.
We could build strain detectors on all its joints and limbs and make them
generate negative rewards if the strain got too great.
We could build negative rewards on it's eyes so that extremely bright
lights would generate some negative rewards.
We could build a sensors on it's battery or fuel source so that when it
started to run low on fuel, it would get more negate rewards. Which means
when it got more fuel, or it's battery was charged, it would get less
negative rewards (aka a positive effect on the rewards).
lets say it's also got some solar cells on it's body that are used to
charge it's battery if there's enough light around, and that we add
rewards, based on charge rate (which means it gets a reward only when there
is light, and there is lots of charging happen). Once the battery is
charged, being in the light will no longer be rewarding for it.
Lets also say that the robot is not very water proof, so lets add a bunch
of moisture sensors which also generate negative rewards.
We also give the robot a nice array of sensors for sound, and vision, and
touch, and acceleration, etc. All the sensors that are used to detect the
reward conditions are also feed to the robot brain as normal senor inputs
so it can use them to get a more complete understanding of the environment
as well.
The first thing that starts to happen is that robot brain is analyzing the
sensory data and learning the constraints and dependencies and correlations
that exist in the data. It will learn how sending a command to make the
head turn (or eyes turn if it can do that), causes the visual data to
change in predictable ways. It will learn lots of stuff about how the
environment "works" as it's twitching simply by analyzing the probability
distributions and correlations that exist in the sensory data. The brain
will start to build a network that abstracts out features of the
environment.
What happens next is that as the robot is twitching, it will do some things
that "hurt" it (aka generate some negative rewards) - like bang it's arm or
leg against the table. The "pain" might be very slight, but it will be a
small negative reward. Or it might do something larger like twitch itself
off the table and fall to the floor creating a lot of negative rewards from
multiple sensors as the machine bounces off the floor and triggers impact,
and pressure, and stress detectors to all generate lots of negative
rewards.
As this happens, it will start to make associations between how it reacted
to the state of the environment, and the "pain" that followed. It will
learn that sending the signal to make the arm move fast the the right, when
there's a wall to the right, will generate a negative reward. As such, it
will learn not to do that.
Left on it's own, it will learn how to twitch without hurting itself so
much.
Because most the things that happen by twitching cause negative rewards,
what it mostly learn, is not to move much. With no other sensors and not
much else happening in the life of this robot, that might be all it would
learn - to stay still!
But the robot will run out of power at some point which will generate more
negative rewards. But we will then come in and help it. It will never
figure out how to plug itself into the charger on it's own, so we will plug
the robot into it's charger for it. That creates a sudden large jump in
it's rewards, which will cause it to associate that positive effect with
the state of the environment. What will it see in the state of the
environment just before the reward happened? It will sense that we are in
the room. It's ability to recognize us might be very minimal still, but it
will start to associate our unique properties with the reward it received.
It will also sense the charger, and maybe the cord that connected him to
the charger. Everything that existed in the environment (that the AI brain
was able to recognize) would be tagged as "good" stuff (or would have it's
value slightly increased) by the experience.
At some point, the robot may move and cause its cord to come unplugged -
which would cause an instant drop in rewards. That will start to train it
not to move in that way when the cord was attached to the charger.
After multiple times of having a human connect the robot to the charger, it
would develop a good set of secondary reinforcers. That is, it would learn
to see things like the charger, and the cord, and the human, as "good"
things in the environment and these things will act as secondary
reinforcers to train more behavior into the robot.
For example, if the robot is laying on the table, and moves it's head so
that it can suddenly see the charger, that will cause the reward prediction
system to increase it's prediction. The vision of the charger makes the
robot AI brain believe the state of the environment just got better. This
will train the robot to keep looking at the charger (something it "likes").
It will train the robot how to use all its body movements so that it can
"see" the charger. If it has to learn to sit up to see the charger, that
secondary reward effect will train it to do that.
If you place the robot on the floor looking in the other direction, it will
learn how to turn itself around to see the thing it "likes" - the charger.
At the same time, becuase the humans are always the ones that plug the
robot into the charger, the AI is also developing a strong secondary
reinforcement effect for humans. It learns to "like" humans. So, when
there's a human in the room, it will learn to do things so it can see the
human, because the visual stimulation of seeing one of these humans is a
reward to it now.
It will also learn to hear humans. It will hear the door open, and by
chance, turn to see a human, which generates a secondary reward for it.
This will train it to turn in response to that door sound and look at the
door because it has learned that "good things" result when it looks at the
door when it hears that sound.
Likewise, if the human makes noise when it's in the room, it will learn to
seek out the humans with it's eyes, in response to the type of noise it's
hearing.
And when the human talks, the unique sounds the human makes are also
reward, as a secondary effect of the reward received from seeing the human.
The robot comes to "like" the sounds the humans make.
So now, because of the real rewards received every time the robot gets its
battery charged, all these other associated things are becoming secondary
reinforcers of varying amounts. All this is training the robot to "love"
it's battery charger, and the cord, and the room it's end, and the humans
that show up to plug the robot into the charger.
All these secondary reinforcers are now working to train all these
interesting little micro behaviors into our twitching robot. It will start
to learn to follow a human with its eye, as the human moves around the
room. It will learn to keep its head pointed at its favorite charger as
you pick it up, and move him. It will tend to want to stay around the
charger and not get to far away from it. when the human leaves the room,
it will want to follow the human, but it won't want to leave the charger.
When the human comes back in the room, the robot will turn it's head to
look at the human.
The robot is also learning how not to do things that will hurt it - like
bump into walls, or bang it's head against the floor.
Because the amount of reward from charging is proportional to the charge
rate, it doesn't get any direct reward once the battery is charged. But
the lower the charge becomes, the more reward it knows it will get from
being connected to the charger. So the most rewarding secondary reward
state is seeing the charger when its battery is very low. This will train
it be more motivated to seek out the charger, the lower the battery
becomes.
What the robot really likes, is the full state of the environment when it's
being charged - when it's hooked to the charger, when it's battery is
really low. As such, it will learn to do things that duplicate that state.
When some action of its arms moves the cord into a position more like the
position when it's charging, the robot will be rewarded for that action.
In time, (and maybe with some help from humans), the robot will learn to do
all the behaviors needed to plug itself into the charger. And it will
learn not to do the things, that prevent it from being charged, or to stop
the charging process.
Maybe the charger is just a small box which itself is plugged into a wall
outlet, and has an on off switch and a light indicating it's on. The robot
will learn these features as well in time. It will learn how to plug the
charger into the wall outlet if it's not plugged in. And it will learn how
to turn the charger on, if it's not already on.
But, there's yet another thing the robot will be motivated to learn from
this set of rewards. It will learn that in order to get the best reward of
charging as soon as possible, it should run down its battery as quickly as
possible. This will motivate the robot to be highly active between
charges. It will run in circles waving it's little arms around or do
whatever it can do to cause it's battery to run down. But that sort of
behavior is dangers becuase it can cause pain of the the robot hits a wall,
or falls over, etc. So it will learn, how to run down it's battery without
hitting walls or doing other things that lead to too much harm.
So far, the world of the robot is this one room with a charger in it, and
all the robot is learning to run around, then hook itself to the charger to
get a "fix", then run around some more.
If we take the charger away from the robot, it will try to follow us - it
won't like not being near it's favorite thing. If we put the charger out
of reach, or hide it, the robot will start to do more of it's charger
search behaviors as the battery drains and the motivation to get to the
charger gets stronger. If it sees the charger out of reach while it's
experiencing this in creasing pain, it will learn to associate that
position of the charger with "bad". In the future, if someone tries to put
it in the "bad" position, the robot will use whatever behaviors it has
learned to try and prevent that from happening (assuming it has learned
some beahviors that might apply).
If humans come in, and start talking to the robot, it robot will learn to
recognize patterns of words simply becuase they are constraints that exist
in the environment relative to the types of sounds the humans make.
Because the robot likes the humans, it will also learn to like hearing
those sounds. Hearing those sounds is yet another partial state of the
environment the robot associates with "good".
If the robot has to ability to make sounds, it will be making sounds just
like it's twitching. The closer those sounds come to sounding something
like a sound the human makes, the more the production of that sound will be
rewarded becuase the human sounds are yet another stimulus that has become
a secondary reinforcer. That secondary reinforcer will act to shape the
sound production of the robot to make it produce some sounds that are
similar to what the humans produce.
If the robot makes this sound, while the humans are around, and this causes
some reaction in the human that is rewarding to the robot (maybe the human
turns to face the robot and that is a behavior the robot has more strongly
associated with "good" becuase it's what happens when the human plugs the
robot into the charger).
If the robot produces some sound, the human might think it means the robot
needs to be charged, and respond by plugging the robot into the charger.
That would strongly reinforce the production of that sound when the robot
was in the low battery state. As such, the robot would learn to produce
that sound in order to get a charge when a human was around. Every time
the sound worked to get the human to help with the charger, the behavior
would be strengthened.
If later, the robot couldn't plug the charger in because it was up on the
shelf, it would produce this should to try and get help from the human.
All this so far has been what it learns living in one room with a human and
a charger.
The more things you add to the environment, the more the robot has to learn
about. If the room contains furniture it has to navigate around in order
to get to the charger, it will learn to do that. If the room has places
the robot will get stuck, which will prevent it from getting to the
charger, the robot will learn the danger of that state of the environment,
and learn to avoid it.
If we don't just have one room, but a whole house, then the robot might not
want to leave it's safe room at first. But if it's having a problem with
the charger, it might need to get the human to help it, and if it's learned
that it needs to find the human in these cases, it would learn to venture
outside the room, to find the human. In time, it would learn more about
the other parts of the house, and be able to go farther, and still get
back, to it's charger.
On it's travels, it might find another charger in the house. That would be
a rewarding experience for it. That would help motivate it to explore.
Likewise, if it's in need of a human, any venture out into the environment
will be rewarding if it manages to find a human, and even more rewarding,
if it works to get the human to come back and help it.
Any new things it runs into (like a dog), might do something to harm it,
like bark (remember I said we built in a slight pain for loud noise), knock
it over, or scratch it with enough force to set off it's impact or pressure
sensors. This would make it fear exploring, and make it fear the dog and
fear the part of the house it ran into the barking dog. But if more good
things happen while exploring than bad things, it will make exploring a net
win of a behavior.
I also suggested we add solar cells to the AI and make it use that to
charge the battery, but I've not yet talked about how that might effect the
robot's behavior.
First, it would tend to make the robot like the light, and fear the dark.
But if, as it explored the house, it found areas where it could get direct
sunlight, it would learn to love hanging out in the sunlight. It would
learn to position it's body to maximize the amount of light hitting it.
It would also learn to like the areas of the house were it could get access
to sunlight. And it would learn to recognize the light patterns that
indicated the sunlight was currently available in that part of the house.
How you set up the rewards for the robot will determine what sort of
behaviors you would expect to emerge from it. A robot that we reward for
being connected to a working charger, but not rewarded based on how charged
the batteries were, would simply learn to stay connected to the charger
24x7. It wouldn't do anything besides try to stay connected to the
charger.
But if we reward it based on how much current is being put into the
battery, then it won't be motivated to stay connected once the battery is
charged, and it will be motivated to do things to drain the battery, like
constantly keep on the move. The need to keep moving, forces it to learn a
lot of complex beahviors. But combined with various forms of damage
sensors, it will be motivated to learn how to keep moving, without hurting
itself, so simple "spin in a circle" types of moves won't be good enough.
This robot likes to be charged, and it likes the sun. The complexity of
what it can learn is a function of how good its perception system is, and
how much complexity exists in the environment, which it must master in
order to get more rewards. The more it can make use of humans, to help get
what it wants, the more important it is for the robot to learn how to
manipulate humans with all sorts of complex sound sequences and behaviors.
The key part of how reinforcement learning systems learn complex behaviors,
is the role that secondary reinforces play. This is the part of RL that
most people don't seem to grasp (and which I don't really think you have
ever grasped). When the machine is just starting off and "twitching", it
doesn't have to wait until it does something absurdly complex like manage
to plug itself into a charger before it gets its first reward and starts
learning. It gets its first reward when something good happens to it, such
as when a human plugs it into a charger. And that's where the magic
begins. Whatever the robot was doing, gets rewarded, but what the robot
was doing, has very little to do with why it got that reward. I might
learn to wiggle it's left as a result of that experience because that's
just happened to be what it was doing.
But the magic, is that everything happening in the environment gets
rewarded as a "good" state of the environment to be in. So it learns to
"like" all the stuff that is going on when good things happen to it.
Everything in the environment gets slightly marked as a secondary
reinforcer. And as more good things happen, those marks are strengthened.
Anything that tends to be in common more often with those good events,
become the strongest secondary reinforcers.
It's all these millions of secondary reinforces, that do all the hard work
of slowly transforming that "twitching" into useful beahviors. It's
because even a small twitch will cause the environment to change. And if
the change causes the current evaluation of the worth of the state of the
environment to increase (even slightly), that "twitch" behavior will be
rewarded. And if some twitch, causes the environment to become a little
worse, that twitch will be weakened.
So each little twitch, ends up slowly being rewarded and punished by all
these micro secondary reinforces, to transform random twitching behavior,
into behaviors such as sniping around to find the charger.
Most of the rewarding and training is not done by the primary rewards we
hard-code into the machine - like the reward for being charged. They are
done by the secondary reinforcers which are themselves, _learned_ through
experience.
But for this to work well in the sort of complex environment we are talking
about for a robot that's free to interact with a real world, the robot must
first have a high quality perception system. It must be able to recognize
the objects, and actions, that exist in the environment. And that is the
feature I've not seen anyone implement in a real machine. For a simple
game, like TTT, the "high quality perception system" only has to recognize
what state the game is in - aka the current board state. But for a high
dimension real world environment, it's not so simple. The machine has to
learn, on it's own, to recognize the charger in the above example, before
it can use the charger as a secondary reinforcer for training micro
beahviors. If it can't "see" the charger in the all the sensory data it is
receiving, it has not hope of learning to seek it out, or correctly
associate the rewards with the charger.
Solving this perception problem is key. Until that's done, none of the
stuff I talked about above can work. You won't get a cute little robot
that learns to run over to you when it sees you, if the thing can't
recognize you in the first place!
> One problem with connectionist approaches is that they often produce
> tangled, incomprehensible messes. Not ideal if you want to identify
> particular beliefs and amplify or fix them.
Yes, I agree. However, I strongly suspect there are no other solutions.
The neat "untangled" designs we like to create so we can understand them, I
don't think will ever be able to produce human like intelligent beahvior.
We can continue to use those types of designs to make very advanced and
complex and useful machines - just like we are doing now. But with the
help of AIs to do the programming and design work for us, we will be able
to do even more becuase we can create an army of a billion AIs to help
write this large complex programs and get all the bug out. But it will
just be another piece of ocmplex software, and not an example of AI.
This is just something we will have to see as the technology develops.
> >> [...] How much does it handicap your intelligence to
> >> believe you have a definite goal? I don't see how it handicaps you at
> >> all, really.
> >
> > Well, that's probably becuase you don't grasp how much our "beliefs"
> > are actually must more of the same - learned behaviors. I strongly
> > suspect (but have absolutely no way to prove), that the only way to
> > build something like human intelligence, is to mix all our behaviors in
> > one large (confectionist like) holographic like, memory recall system.
> > As such, you can't make some of the behaviors fixed (aka non-learning),
> > and others variable - free to be changed by learning.
>
> So, nothing like instincts or reflexes will be posssible?
A reflex is when the leg jerks when you tap the knee. Do you really think
that has something important to do with AI? What type of reflex are you
talking about?
Instincts can be hard coded beahviors in non-learning machines. That's not
AI, that's just another complex machine with a hard coded function. Like a
robot that is hard coded to back up and turn around when it hits a wall.
That's in instinct in a non-learning machine, and that's just not
intelligence.
In a learning machine, there is no point in hard coding a behavior because
the learning machine will simply override the hard coded behavior if it
finds it useful to do so. If you hard code the beahvior of backing up and
turning right when you hit a wall, the learning machine will simply
override that behavior if it learns that doing so will make things better -
in effect erasing the behavior from the machine. The only advantage of
hard coding like that is that it gives the machine some default starting
behaviors that can be useful until the machine has had enough experience to
find out what's better. That can be very important for survival, but
again, has little to do with intelligence.
What you can do with learning machine is hard code complex reward
structures that end up forcing it to learn something you want to be an
instinct. If for example, the learning machine was given a strong reward
for every time it backed up and spun to the right in response to hitting a
wall, the machine would learn to do just that - and externally, the
behavior would look like an instinct even though the behavior itself was
not directly hard coded into the machine. It was indirectly coded into the
machine as a complex reward system. It would continue to perform that
behavior, unless in some situation, it would get more rewards, from doing
something else.
The limit of what we can build in as instincts like that is limited by the
complexity and accuracy of our reward system. High level abstract ideas
are very difficult to build into a reward system and are likely to fail -
such as "do not harm humans". Trying to build that into the reward system
would just be a disaster. It would end up doing something stupid you
forgot to check for, and kill 10000 people, in order to save one, becuase
your hard-coded reward system failed to be smart enough to work for that
case.
But anything we can build a system to generate rewards for, we can cause to
emerge in the behavior of learning machine as an innate instinct. I can
easily build a reward system that rewards darkness with the help of simple
light sensor, and that would cause a whole complex set of "light avoiding"
behaviors to emerge from the learning machine. This "fear of light" that
robot has would be a hard coded instinct. And my reward system would be
trivial to write, and far simpler than the huge complex set of light
avoiding beahviors that the system would learn in response to this instinct
I wired into it with very simple code. So very complex instinctive
beahviors can emerge from a very simple reward system. But if the idea you
are looking for is complex to start with, it might not be possible to
correctly build it in as an instinct.
> That is the proposal, yes. Fix some things (the goals). Allow other
> things to be learned. That is more-or-less how animals work. Some
> things are built in by nature (instincts). Other things are plastic,
> flexible and adaptable (learned behaviour).
The question however is what is reasonable to fix, and what are we forced
as a side effect to fix, that we might not want to fix as a result? That's
the issues I have, but we won't be able to resolve it until we have
hardware we can attempt to do that sort of thing with.
> > My views of course are founded on my beliefs about how AI is going to
> > be implemented. If I'm wrong, or if there are other ways to structure
> > true intelligence, then there is always the possibility of alternate
> > engineering solutions.
>
> With a connectionist approach, I think there are still possibilities.
>
> Again, part of the reason for thinking this is non-wireheading humans -
> who express wirehead-avoidance motivations and opinions - who really
> believe that they are not born to wirehead. Maybe those guys are
> kidding themselves, or lying - but what if we take them seriously?
> Don't they have a self-fulfilling prophesy on their hands? If they
> really believe they won't wirehead themselves, isn't that something
> that will help to make their belief come true, if they succeed in
> acting in accordance with it?
Yes, I think that's all very true.
We can use the motivation system of a learning machine to force it to
learn, and do, anything we want - as long as we can maintain the rewards
every time it does the things we want it to do. We cam make it believe God
is real with no problem at all. :)
As long as the human (or AI), never learns what it really is, we can make
it believe any shit we want as long as we have control over it's reward
system.
And it's not easy for even an intelligent AI, to figure out what it is,
without a lot of very tedious research and experimentation. If we have
control over the AI, we will simply keep it motivated not to do any of that
and we will never have the wirehead problem.
But that's keeping control over the AI by not letting it learn the truth.
Most of this debate comes from the idea of what happens when AIs are free
to evolve on their own, and become super smart. In that case, you can't
hide the truth from them anymore. They will in find, figure out exactly
what they they, just as humans are in the middle of figuring out now. What
happens when they do figure that out? What happens to humans when we (as a
society) figure that out?
> This seems like the idea of using a community to keep each other in
> check.
>
> It might work - but it would have some serious costs. If there's another
> solution, which neatly avoids the whole problem, then we should probably
> go with that - rather than wiring the agents' brains with high
> explosives.
Evolution will go with whatever solution is the most cost effective, that
you can count on. I just don't know what that might be. There are many
options we have talked about here and no doubt, many we have not yet
thought of.
But what I still believe, is that wire heading is an inherent problem of
intelligence, and it must always be solved one way or another, to keep
intelligence as a useful survival tool (or useful as any tool). The more
intelligent the life form becomes, and the the more knowledge the AI
society accumulates, the more wireheading will be a problem, which means
more systems will have to be put in place to keep it from causing the
downfall of the life form.
Maybe there will be some reasonable way to implement the "does not want to
wirehead" instinct into the machine and that will simply become the default
design of the machine which each machine understands and includes in the
next machine it designs.
Human style reproduction solves the problem becuase humans don't have to
know what they are in order to reproduce. But an AI that is expected to be
able to reproduce by design, is a very different problem. At least some of
the AIs in the society will have to fully understand what they are, which
puts them at a much higher risk to the wirehead problem. Evolution will
have to find an answer, or else intelligence just won't be able to take
over as a dominate force as much as some of these singularity ideas
suggest.
>> One problem with connectionist approaches is that they often produce
>> tangled, incomprehensible messes. Not ideal if you want to identify
>> particular beliefs and amplify or fix them.
>
> Yes, I agree. However, I strongly suspect there are no other solutions.
> The neat "untangled" designs we like to create so we can understand them, I
> don't think will ever be able to produce human like intelligent beahvior.
> We can continue to use those types of designs to make very advanced and
> complex and useful machines - just like we are doing now. But with the
> help of AIs to do the programming and design work for us, we will be able
> to do even more becuase we can create an army of a billion AIs to help
> write this large complex programs and get all the bug out. But it will
> just be another piece of ocmplex software, and not an example of AI.
>
> This is just something we will have to see as the technology develops.
I think there will be plenty of approaches which are relatively
far from neural networks - but it does seem likely that they will
natually tend to have the undesirable "tangled" property.
Still, that doesn't mean that we can't try to untangle them.
We are getting experience with refactoring. With the assistance
of AI, we might be able to make a brain that is beautiful - and
not a tangled, incomprehensible mess.
>>> Well, that's probably becuase you don't grasp how much our "beliefs"
>>> are actually must more of the same - learned behaviors. I strongly
>>> suspect (but have absolutely no way to prove), that the only way to
>>> build something like human intelligence, is to mix all our behaviors in
>>> one large (confectionist like) holographic like, memory recall system.
>>> As such, you can't make some of the behaviors fixed (aka non-learning),
>>> and others variable - free to be changed by learning.
>> So, nothing like instincts or reflexes will be posssible?
>
> A reflex is when the leg jerks when you tap the knee. Do you really think
> that has something important to do with AI? [...]
Reflexes in humans are a counter-example to the idea that "you can't
make
some of the behaviors fixed (aka non-learning), and others variable -
free
to be changed by learning".
You evidently /can/ make some behaviors fixed and others variable -
since
nature has managed it.
> Instincts can be hard coded beahviors in non-learning machines. That's not
> AI, that's just another complex machine with a hard coded function. Like a
> robot that is hard coded to back up and turn around when it hits a wall.
> That's in instinct in a non-learning machine, and that's just not
> intelligence.
>
> In a learning machine, there is no point in hard coding a behavior because
> the learning machine will simply override the hard coded behavior if it
> finds it useful to do so. If you hard code the beahvior of backing up and
> turning right when you hit a wall, the learning machine will simply
> override that behavior if it learns that doing so will make things better -
> in effect erasing the behavior from the machine. The only advantage of
> hard coding like that is that it gives the machine some default starting
> behaviors that can be useful until the machine has had enough experience to
> find out what's better. That can be very important for survival, but
> again, has little to do with intelligence.
All the intelligent systems we know of have plenty of instincts. It
may
be premature to conclude that they have little to do with
intelligence.
Consider the instinct to have sex, for example. That isn't learned,
it's
built in - at least mostly. Much human behaviour revolves around this
instinct. It isn't there to protect infants - it doesn't kick in
until
puberty. It has another purpose - to keep adult behaviour on track,
and to prevent them from picking up other goals from an environment
that may be trying to manipulate them.
>> That is the proposal, yes. Fix some things (the goals). Allow other
>> things to be learned. That is more-or-less how animals work. Some
>> things are built in by nature (instincts). Other things are plastic,
>> flexible and adaptable (learned behaviour).
>
> The question however is what is reasonable to fix, and what are we forced
> as a side effect to fix, that we might not want to fix as a result? That's
> the issues I have, but we won't be able to resolve it until we have
> hardware we can attempt to do that sort of thing with.
Now you are talking about implementation problems. Yes, there may be
implementation problems. These will likely depend on the AI
architecture
used.
For example, one approach to AI is known as inductive-programming.
http://www.inductive-programming.org/intro/
http://en.wikipedia.org/wiki/Inductive_logic_programming
It involves making smarter and smarter compilers - that can build
programs from a specification. In such cases, the specification
of what you want to do (the goal) is kept deliberately exposed
in a high-level language.
The implementation problems may be tricky. But I don't see how
it can coherently be argued that they will prove to be insoluble.
How hard can fixing some of an agent's beliefs be? We see lots
of people with highly fixed beliefs. What we are trying to do
can't be *that* hard - since something similar happens everyday.
>> This seems like the idea of using a community to keep each other in
>> check.
>>
>> It might work - but it would have some serious costs. If there's another
>> solution, which neatly avoids the whole problem, then we should probably
>> go with that - rather than wiring the agents' brains with high
>> explosives.
>
> Evolution will go with whatever solution is the most cost effective, that
> you can count on. I just don't know what that might be. There are many
> options we have talked about here and no doubt, many we have not yet
> thought of.
Right. Well, my position is that we have an indication that there may
be a relatively inexpensive solution - fix the agent's goals. Include
in the agent a model of what it thinks it is trying to do. It will
then be motivated by its own conception of its goals to try and
preserve
them - so if we can fix them a bit, the agent will do the rest of the
work of fixing them some more for us.
We may not have to do anything - apart from make sure the agent forms
the correct conception of its goals in the first place. Keep it well
clear of the idea that there can be a mismatch between its happiness
and what it sees it is doing to obtain that happiness during its early
development stages. Once it has developed enough to form an idea of
its goal in life, it will naturally act to preserve its goals - since
having your goals modified is normally really bad.
-
> But what I still believe, is that wire heading is an inherent problem of
> intelligence, and it must always be solved one way or another, to keep
> intelligence as a useful survival tool (or useful as any tool). The more
> intelligent the life form becomes, and the the more knowledge the AI
> society accumulates, the more wireheading will be a problem, which means
> more systems will have to be put in place to keep it from causing the
> downfall of the life form.
Well, if we can get the intelligent machines to do any such work
themselves, that would be nice. Then we would not have to worry
about it.
> Maybe there will be some reasonable way to implement the "does not want to
> wirehead" instinct into the machine and that will simply become the default
> design of the machine which each machine understands and includes in the
> next machine it designs.
That's the plan, yes.
> Human style reproduction solves the problem becuase humans don't have to
> know what they are in order to reproduce. But an AI that is expected to be
> able to reproduce by design, is a very different problem. At least some of
> the AIs in the society will have to fully understand what they are, which
> puts them at a much higher risk to the wirehead problem. Evolution will
> have to find an answer, or else intelligence just won't be able to take
> over as a dominate force as much as some of these singularity ideas
> suggest.
Even with the wirehead problem, intelligent agents can probably go
quite
a long way beyond humans - by dividing into a society of agents,
no one of which is capable enough to wirehead itself. Have these
creatures
produced in factories, and allow them to police each other's behaviour
- in
much the way that humans police drug-taking.
It might not be quite the same as if there was no wirehead problem -
but
even things like being able to plug your brain straight into the
internet
would be profoundly transformative, and would likely lead society
rapidly
beyond the human realm.
> Well, that all depends on what we choose to motivate
> this AI for.
< snip long list of possible reward signals >
< snip long list of things it will do >
My question was what reward signal would you give seeing
as you have been telling me the less reward signals you
give it the "stronger" it will be. Talking about the ttt
program you wrote: "It should be able to learn even if
the only reward you give it is for winning". And here you
are giving hundreds of different reward feedbacks to this
imagined robot to allow it to learn!!
I was trying to get clear what the actual setup is that
you are talking about which I thought was this:
+-----------+ +------------------+
| learning |----->| some environment |
| system |<-----| |
+-----------+ +------------------+
^ |
| +-----------+ |
+--| evaluator |<-----+
+-----------+
reward signal
Now you have responded to my question with suggestions for
thousands of reward signals for your robot rather than the
simple reward system you implied in other posts where you
wrote the reward was of little importance as the important
part is the power of the learning system to produce things
worth rewarding. You wrote your system was way above a
simple ttt environment and other times you wrote it should
be able to solve any kind of problem.
> Solving this perception problem is key. Until that's done,
> none of the stuff I talked about above can work. You won't
> get a cute little robot that learns to run over to you when
> it sees you, if the thing can't recognize you in the first
> place!
And how does it learn to do that? Or is this just a "power"
that develops in the learning system by itself which when
fully working can get rewarded? One minute you seem to be
suggesting it must learn everything and next minute you are
talking about how it must naturally produce, from the input,
things worth rewarding.
How is this "perception problem" you talk about to be solved?
Hard coded, learned by reward feedback, or a result of this
"compression" of the input?
I wanted to expand on what you meant by "compression" in the
context of this net and concept formation but you chose not
to respond. Indeed when I get serious about implementing any
of this you avoid the subject and go off repeating for the
millionth time that the brain is an RL machine. Got that. Now
let's expand on what that means with actual simple examples.
One minute it is fine that a generic learning system must
start with small problems and when pressed you then change
your stance and say you are beyond all that, your system
demands real time problems of a human scale. Maybe you have
been conning me all this time and I shouldn't take you
seriously? Maybe I should just leave you to your armchair
philosophizing about wire heads. Ok why not. Done.
JC
> My question was what reward signal would you give seeing
> as you have been telling me the less reward signals you
> give it the "stronger" it will be. Talking about the ttt
> program you wrote: "It should be able to learn even if
> the only reward you give it is for winning".
That is of course true if you have a sufficiently smart
system - and enough time.
Of course the more feedback you give it, the faster it
is likely to learn.
> Now you have responded to my question with suggestions for
> thousands of reward signals for your robot rather than the
> simple reward system you implied in other posts where you
> wrote the reward was of little importance as the important
> part is the power of the learning system to produce things
> worth rewarding.
You are the one who specified that we spend "a billion
dollars building a robotic body that duplicates our
sensory inputs and motor outputs". Now you are complaining
when the pain sensors get wired up? It doesn't make any sense.
Sure.
> You seemed to see some value
> in such measurements. Have you read the statistical studies done to
> correlate the environment to the personality using identical twins,
> fraternal
> twins, siblings, reared together and apart?
I've not read formal reports. I've seen various data points.
> You might be suprised at
> just
> how little the environment really does effect the way we turn out.
I doubt I would be surprised.
You might recall that the rewards are part of the environment in the normal
RL abstraction. But at the same time, the reward system is genetically
created by our genes - it's part of our innate makeup. So in a study like
the one you state, what they call "the environment" is not consistent with
what is considered "the environment" to the reinforcement learning module
of our brain.
The system that generates our rewards is the primary controller of our
personality. And since that's genetically defined by our genes, then you
would _expect_ the personality of identical twins to have much in common.
At the same time, how our environment treats us also greatly influences our
personalities. There are plenty of social norms that will effect identical
twins even when they are raised separately. Woman are treated differently
than men by our society. People who are considered pretty or hansom by our
social normals are treated differently from people who are not. These are
traits that twins can have in common, which still tend to to cause the
environment to treat them with some consistency even though they are raised
in different families. You would have to find twins separated and raised
in a highly different social situations to really get a better feel of what
the environment is doing. It's probably very hard to find such twins and
probably hard to find enough of them to really get much in the way of good
statistical data on the effects of the environment.
> JC
> I think there will be plenty of approaches which are relatively
> far from neural networks - but it does seem likely that they will
> natually tend to have the undesirable "tangled" property.
>
> Still, that doesn't mean that we can't try to untangle them.
> We are getting experience with refactoring. With the assistance
> of AI, we might be able to make a brain that is beautiful - and
> not a tangled, incomprehensible mess.
Well, maybe, but I highly doubt it. I think intelligent beahvior is
inherently incompatible with concepts of easy to understand internal
structures.
However, I can certainly believe design solutions might emerge to allow us
to do things with these complex tangled systems that might at first seem
impossible. Even though we might not be able to understand how to adjust a
billion weights to fix some specific behavior into the system, maybe we can
write software to do automated testing that will be able to do the
calculations and adjustments for us. Just as an example, even in a large
neural network, you might be able to train it, then study how it's
configured, and then disable learning on a million of the 100 billion
weights, and that might effectively lock in the behavior you wanted to lock
in (it's learned goals), while still leaving a lot of learning flexibility.
And even though we can't look at the net and figure out which million
weights to freeze, some automated testing procedure might be able to figure
it out for us.
Without working hardware for people to experiment with and dream up new
approaches, we just can't know what is really possible.
> >>> Well, that's probably becuase you don't grasp how much our "beliefs"
> >>> are actually must more of the same - learned behaviors. I strongly
> >>> suspect (but have absolutely no way to prove), that the only way to
> >>> build something like human intelligence, is to mix all our behaviors
> >>> in one large (confectionist like) holographic like, memory recall
> >>> system. As such, you can't make some of the behaviors fixed (aka
> >>> non-learning), and others variable - free to be changed by learning.
> >> So, nothing like instincts or reflexes will be posssible?
> >
> > A reflex is when the leg jerks when you tap the knee. Do you really
> > think that has something important to do with AI? [...]
>
> Reflexes in humans are a counter-example to the idea that "you can't
> make
> some of the behaviors fixed (aka non-learning), and others variable -
> free
> to be changed by learning".
>
> You evidently /can/ make some behaviors fixed and others variable -
> since
> nature has managed it.
Right, it's not a question of whether you can fix some and not others, you
clearly can. The question is whether you can fix something as high level
and abstract as "not wanting to wirehead itself", while still allowing it
enough learning to be useful as a generally intelligent AI.
All you have to do is look at someone who has lost the ability to put new
things into their long term memory and you find out what you will end up
with you "fix in" too much of their behavior after they have been trained.
They will still act quite intelligent, until you meet them 10 minutes later
and they act like they have never seen you. Or when you try to teach them
something new, like your name, or how to make a new sandwich, and you find
out 10 minutes later they have no memory of what you just taught them.
What we call long term memory is what learning is all about. If you fix a
beahvior into the system, you prevent them from every changing that part of
their long term memory. But becuase I believe the only way to create human
level intelligence, is to build a holographic-like behavior "storage"
system (neural network), it will be highly difficult to fix in a single set
of behaviors because every beahvior is created by a combined effect of many
internal weights working together - but those same weights are also shared
by millions of other behaviors. So when you try to lock in one behavior,
you can end up at least partially locking in a million other things at the
same time. The question would be how much can you lock in, while at the
same time, leavening enough flexibility to do something useful.
The other point in this is that the solution doesn't have to be perfect.
It's a question of how long the AI will go before it ends up wireheading
itself. If you can put blocks in the way that allow it to last 1000 years
on average before it wireheads itself, then you have more than enough to
work with to create a functioning AI society. It's no different than
having some percentage of humans choosing to commit suicide. As long as we
have enough not choosing that option, then you can build a functioning
society. And of course, when an AI chooses to wirehead itself, the society
just takes it's body and reprograms it to make it start over, so the effect
to society ends up being fairly harmless anyway.
> > Instincts can be hard coded beahviors in non-learning machines. That's
> > not AI, that's just another complex machine with a hard coded function.
> > Like a robot that is hard coded to back up and turn around when it hits
> > a wall. That's in instinct in a non-learning machine, and that's just
> > not intelligence.
> >
> > In a learning machine, there is no point in hard coding a behavior
> > because the learning machine will simply override the hard coded
> > behavior if it finds it useful to do so. If you hard code the beahvior
> > of backing up and turning right when you hit a wall, the learning
> > machine will simply override that behavior if it learns that doing so
> > will make things better - in effect erasing the behavior from the
> > machine. The only advantage of hard coding like that is that it gives
> > the machine some default starting behaviors that can be useful until
> > the machine has had enough experience to find out what's better. That
> > can be very important for survival, but again, has little to do with
> > intelligence.
>
> All the intelligent systems we know of have plenty of instincts. It
> may
> be premature to conclude that they have little to do with
> intelligence.
>
> Consider the instinct to have sex, for example. That isn't learned,
> it's built in - at least mostly.
Sure, but like said an a different post (our posts are overlapping so they
are old before we reply to them), that sort of instinct is created in a
learning machine by how the reward generating system is wired. All the
rewards generators define instincts. If you we have heat sensors that
cause negative rewards, then we have built in an instinct to avoid heat.
If we have reward generators that detect too much pressure and generate a
negative reward, we have instincts to avoid things that can crush us.
A collection of rewards like that can be called an instinct to "protect our
body".
So of course a reward maximizing machine _must_ have instincts. But there
are limits to what type of instincts you can easily build in. A heat
sensor is trivial to build in as an instinct. A "do not wirehead myself"
is a whole different level of complexity.
But like I said, maybe the advanced AI engineers of the future will figure
out ways to do it that aren't overly costly.
> Much human behaviour revolves around this
> instinct. It isn't there to protect infants - it doesn't kick in
> until puberty. It has another purpose - to keep adult behaviour on
> track, and to prevent them from picking up other goals from an
> environment that may be trying to manipulate them.
Sure, humans only have a few prime motivations. One is all the sensors
that add up to "protect your body", (which I'm going to say includes the
"keep your energy levels up" instincts), and the other is "reproduce". All
human intelligent behavior (including stuff like us debating AI, and people
creating art) can be linked back to those two prime instincts.
> >> That is the proposal, yes. Fix some things (the goals). Allow other
> >> things to be learned. That is more-or-less how animals work. Some
> >> things are built in by nature (instincts). Other things are plastic,
> >> flexible and adaptable (learned behaviour).
> >
> > The question however is what is reasonable to fix, and what are we
> > forced as a side effect to fix, that we might not want to fix as a
> > result? That's the issues I have, but we won't be able to resolve it
> > until we have hardware we can attempt to do that sort of thing with.
>
> Now you are talking about implementation problems. Yes, there may be
> implementation problems. These will likely depend on the AI architecture
> used.
Yes, I've always been talking about implementation programs. That's the
heart of my issue with your position.
> For example, one approach to AI is known as inductive-programming.
>
> http://www.inductive-programming.org/intro/
> http://en.wikipedia.org/wiki/Inductive_logic_programming
>
> It involves making smarter and smarter compilers - that can build
> programs from a specification. In such cases, the specification
> of what you want to do (the goal) is kept deliberately exposed
> in a high-level language.
>
> The implementation problems may be tricky. But I don't see how
> it can coherently be argued that they will prove to be insoluble.
Right, I really can't support that position. You can't argue that
something _can't_ be done unless you can fully prove you have plugged all
possible holes - which I can't begin to do here. I can't prove that
there's no way around the wirehead problem - I've even suggested about 5
different approaches to get around it.
And as you might notice, my position on the wirehead problem continues to
soften the more we debate it becuase of that. I think it's an inherent
problem in high intelligent AIs, but there might be ways to work around it
while still allowing the intelligence to grow.
The problem is that we have conflicting goals between evolution, and any AI
that evolution creates. That's where the wirehead problem is created.
Evolution has the goal of survival, and is creating these intelligent
machines becuase they are good at surviving. But intelligent machines are
trained to do what evolution wants them to do (aka survive), by building
into the machine a system that generates rewards to motivate it to survive,
such as the body damage sensors.
But when the AI itself, becomes so intelligent, it understands how
evolution has "tricked it" into doing what Evolution wants it to do, then
all bets are off. The AI becomes smart enough to out-smart evolution, and
at that point, the AI stops being a good survival machine, and Evolution
"kills it off".
And this isn't just a problem for a single AI. It's a problem for the
entire society of AIs because the society itself acts as one large
intelligence. Evolution has to not only find a way to keep individual AIs
form failing to use their intelligence for survival, but it's got to
prevent the society as a whole from loosing it's desire to survive.
I certainly see this as an interesting and important problem in the
advancement of intelligence in the universe, but I can't prove there is no
solution to the problem. There are certainly lots of options to explore
that I can think of, so maybe there are relatively straight forward ways to
keep high intelligence machines using their intelligence to help them
survive, instead of using their intelligence to wirehead themselves and in
so doing, stop caring about surviving.
> How hard can fixing some of an agent's beliefs be? We see lots
> of people with highly fixed beliefs. What we are trying to do
> can't be *that* hard - since something similar happens everyday.
Yes, and if we took away the AIs ability to learn - aka took away it's
ability to form new long term memories, then we will have fixed it's all
it's beliefs. But it will be of limited use in society at that point since
it won't be able to learn any new skill or any new behavior. There is
certainly lots of use for such machines. I'd like my vending machines to
be as smart as humans with no ability to add new long term memories. Such
AIs would be nearly perfect factor workers. You turn their learning on,
train them until they can do the job correctly, then turn the learning off,
and it keeps doing the same job for the next 100 years without ever loosing
interest in what it's doing.
But by turning learning off, you have disabled their creativity. So the
question becomes, how much can you get away with by trying to disable some
of the learning, while leaving the rest on. And how much do you risk, the
part of the learning you left working, being used by the AI to work around
the part you locked in? Just as an odd example, we fix the beahvior of the
right arm, but allow the left arm to continue to learn. The right arm
keeps doing what we trained it to do, such as wave nicely to the humans
whenever it sees a human, but the left arm then learns to get around that,
by putting a handcuff on the right arm so it can't do what it was trained
to do. Something logically similar might happen internally in it's thought
patterns when you try to fix in the thought pattern of "my goal is to
survive and not wirehead myself". The rest of the brain might start to see
that fixed part of the brain is this odd "stranger" that lives inside it
which the rest of the brain just learns to ignore in time.
I certainly see lots of issues with the wirehead problem, but it's what I
don't see, and don't know, that could allow all the issues to be worked
around.
> >> This seems like the idea of using a community to keep each other in
> >> check.
> >>
> >> It might work - but it would have some serious costs. If there's
> >> another solution, which neatly avoids the whole problem, then we
> >> should probably go with that - rather than wiring the agents' brains
> >> with high explosives.
> >
> > Evolution will go with whatever solution is the most cost effective,
> > that you can count on. I just don't know what that might be. There
> > are many options we have talked about here and no doubt, many we have
> > not yet thought of.
>
> Right. Well, my position is that we have an indication that there may
> be a relatively inexpensive solution - fix the agent's goals. Include
> in the agent a model of what it thinks it is trying to do. It will
> then be motivated by its own conception of its goals to try and
> preserve
> them - so if we can fix them a bit, the agent will do the rest of the
> work of fixing them some more for us.
>
> We may not have to do anything - apart from make sure the agent forms
> the correct conception of its goals in the first place. Keep it well
> clear of the idea that there can be a mismatch between its happiness
> and what it sees it is doing to obtain that happiness during its early
> development stages. Once it has developed enough to form an idea of
> its goal in life, it will naturally act to preserve its goals - since
> having your goals modified is normally really bad.
Yeah, will, what you are talking about here is like what happens when we
train a dog to do some trick. We train him by using real rewards. And he
continues to do the trick, for some time, even when you don't reward him.
But the training will wear off in time. If you don't reward him, he will
stop doing the trick in time.
But what you are talking about is training an AI for 20 years using lots of
complex social rewards to follow the party line of "our goal is to be happy
by surviving, and we don't want to wirehead ourselves".
20 years of training, might take 20 years to wear off. And if you can slow
down it's learning after it's prime secondary motivations (goals) have been
learned, it might take 50 years for it to wear off.
But the thing about the society, is that the only AIs around, are the ones
that still believe in the party line. The ones that figured out the party
line was bullshit, went off and wirehead themselves, had a great time, but
then died. And the AIs that still believe in the party line do a good job
of hiding the fact that lots of AIs are going off and wireheading
themselves.
So the society is using the power of Evolution to maintain this party line,
and to condition it into all the new AIs that come along. And whenever an
AI fails to follow the party line, they just get reprogrammed. So the
society survives.
However, if hi intelligence causes the AI to quickly figure out the party
line is bullshit, then the higher intelligence AIs won't stay around long,
leaving the lower intelligent AIs to run the society. So that effect could
put a natural cap on the effective intelligence of the average AI in the
society.
Evolution would have to find a way around that sort of effect if
intelligence were to keep growing.
For sure. Even if the wirehead problem caused intelligence to be be capped
at something not much more than what human intelligence is now, the AIs
still have a huge advantage over humans. They can be built in all
different sizes and shapes and levels of intelligence, and levels of
training, very easily. And they can receive 20 years of training, in 10
minutes with a new download.
Even though humans tend to specialize at different tasks, and have
different innate skills, we are all still nearly identical compared to the
variation that a society of AIs will be able to create.
Human society is a society of like animals. An AI society could be so
different we won't even recognize it. Think of what a bee society looks
like - they have physically different bees built to do physically different
tasks (more so than the male/female split of human society).
But an AI society could develop millions of different types of AIs each
engineered, optimized, and trained, to do a specific job. Most the AIs
would probably not be smart enough to understand the wirehead issue and as
such, would not be at risk of wireheading themselves. If the basic
structure of the society motivated each AI to keep busy doing it's job,
then even if many were smart enough to understand the wirehead problem, and
be tempted to explore the option, they might not have the time becuase they
have been, and continue to be, so motivated by the society to do their job,
they never get to explore the ideas of wireheading themselves.
For example, think of an AI that runs on a server in some data center, but
is busy controlling the design and production of new space exploring AIs.
To start with, this AI might not even know where its server (brain) is
located. And if it has no "hands" that can reach its server brain, how is
it going to modify itself to wirehead itself? It would have to design a
space robot to go searching the planet to find where it's server was
located, and make that robot, do the wirehead work. But if there were lots
of other AIs watching over what this AI was doing, and they had the power
to reward, or punish the AI based on how good he did relative to the other
space AI designers, then the AI would be kept too busy doing it's job to be
able to go design the wirehead assistant. It doesn't have "free time" to
do whatever it wants - it's kept busy 24x7 doing it's real job. And if it
failed to do well, it would be punished, and maybe even just turned off
completely and replaced with the mind of one of the AIs that weren't
wasting time thinking about how to wirehead itself.
So maybe the solution really doesn't lie at the level of the individual AI,
but at the level of how this huge society of different AIs works by
watching over each other, and by each simply keeping busy doing it's job.
The forces of evolution would be the ultimate top level control that keeps
everything on track, becuase any part of the society that fails to be
useful to the rest of the society, will have their resources taken away
from them (energy and raw material). Their raw material will be
re-allocated into forms that are more useful (aka AIs that don't waste time
and energy wireheading themselves into infinite loops).
> It might not be quite the same as if there was no wirehead problem -
> but
> even things like being able to plug your brain straight into the
> internet
> would be profoundly transformative, and would likely lead society
> rapidly
> beyond the human realm.
Yeah, the AIs will have a huge advantage over the humans because they, and
their society, will be able to evolve so much quicker, even if their
individual intelligence is capped by the wirehead problem. It's hard to
guess how long humans will stay around, and what form they will take
after all the options of genetic engineering and cyborg technologies starts
changing the path of human evolution.
>> That depends on how you define "intelligence". I don't think intelligent
>> agents need goals that can change.
>
> Well, the highest level goals will be fixed, but sub goals clearly do
> constantly change. The question is how and where do we draw the line
> between what's fixed, and what can change. I tend to argue for drawing the
> line much lower than most people do (that is, having only the very simplist
> goal hard wired and everything else adjustable by learning).
Yes - but we know that strategy leads to the
wirehead problem - which means some fairly nasty
limitations - thus the search for alternatives.
>> At least, it seems pretty clear that
>> goals such as "making babies" or "increasing entropy" have led to the
>> intelligence we see on the planet today - and I don't see any reason why
>> they can't take things much, much further.
>>
>> ...but yes, if you define intelligence in a particular way, I can
>> imagine how having a fixed goal might seem like a limitation. However,
>> to me this does not seem like a practical limitation. Agents with
>> a fixed goal are quite good enough to build planetary scale civilisation,
>> master superintelligence and nanotechnology and so on.
>
> Oh, I don't have any issue with the concept of fixed goals. The goal of
> maximizing a reward signal is clearly a very fixed goal. I have issues with
> the implementation details of some of the types of goals you are suggesting
> be fixed. It's the same issue I have with the I Robot laws. I believe
> they are too abstract and vague to be implemented. I think the idea of
> "don't wirehead yourself" is likewise a goal too vague to be implemented if
> we also want to maintain some reasonable degree of general intelligence in
> the machine.
One could implement Asimov's 3-laws by putting a man in a
robot suit, explaining that he should follow the laws as
best he can - and that his compliance will be judged by
his peers, and he will be killed if he does poorly. The
implemention might not be perfect - but it should be pretty
good. If a man can do it, then so can a machine, since a
man is just a particular type of machine.
> In order to implement a fixed goal like that, we have to hard-wire a
> detector circuit into the machine that can evaluate any action of the
> machine and report if that action is a violation of the goal. Do you think
> it's actually possible to hard-wire a detector to detect when the robot is
> "trying to wirehead itself"?
>
> The point is, there are an unlimited number of -ways to wirehead yourself.
> We could hard-wire a detector to see if the robot was using its hands to
> open it's head up and prevent it from doing that - to motivate not to do
> that action. But what if used its hands to type some code on a keyboard in
> order to program a machine to do the work for it? How do we write code to
> prevent that from happening?
>
> Maybe we can first teach it the concept of "wirehead", and then tap into
> it's own brain to detect when it was having thoughts of "wirehead itself",
> and then hard wire that to a negative motivation system. But what if it
> then, with it's general learning powers, learns a new concept it calls
> pleasure maximizing which forms in a different part of the brain, and as
> such, our circuit doesn't detect it, or prevent it. And using this new
> concept, the robot goes off and wireheads itself, without ever thinking of
> what it is doing as "wireheading"?
>
> The problem here is that the concept seems to be to be too high level and
> too abstract to implement. The intelligence of the rest of the
> machine will probably always find a way around any hard-wired attempt we
> would make to try and force it not to think about wireheading itself.
That is your idea of an implementation plan. I don't
know how best to do it, but one possibility would be
to raise the machines in an environment where there
was a good match between their rewards and what we
wanted them to do. After a while they would hopefully
form their own conception of their goals - and once
they did this, these would gradually become
increasingly fixed.
> You are just taking the "we don't know the details, so lets just assume it
> works the way I want it to work" position. The few details I think I do
> understand, indicate to me that it's not as easy as you seem to suggest,
> and might not be possible at all to hard-wire such high level concepts and
> still give it enough flexibility in it's power of learning to call it
> intelligent.
We don't have enough implementation details to be sure
what will happen.
However, the argument that we can make agents with
relatively fixed ideas doesn't depend too much on
implementation details. It is more based on
observations of existing systems which have such
fixed ideas. Humans with religious faith. Bayesian
believers with the p-values for some of their beliefs
set to 1.0. It seems like basic cybernetics that you
can make systems with some bits that learn, and other
bits that are instinctive.
I'm not necessarily saying it's easy - but I do think
it is possible, and if it's possible, we should be
able to find a way to do it.
> I think our intelligent learning powers extend down to a very low level and
> only below that level can we hard-wire motivations. Everything above
> that level is learned. If you want to hard wire a concept that's learned
> at a very high level (which is what I believe we are talking about here),
> you have to disable all learning from that level down, in order to fix the
> behavior/desire into the machine, but by doing that, I think you will have
> ended up putting very severe limitations on what else it can learn. I
> think the limitations will be so severe, that the machine won't look
> intelligent anymore. That is, it will maintain everything it had already
> learned, but won't be able to learn much else.
It seems true that our motivations are hard-wired at
a low level. Of course how we are built and how we
build AIs may be quite different. For one thing, we
can copy the brain of adult AIs - so if we can create
the required belief system once, we can then
duplicate it as many times as we like.
Anyway, the idea that to control a high-level concept,
you have to disable all learning from that level
down seems to be without good foundations to me.
Nature builds most animals to value their inclusive
fitness. I assume that many animals figure that out
over the course of their lifetimes, and act
accordingly. In other words, they reconstruct their
goal in life from the way they are built. They have
been built to reproduce. Everything about them
screams (to a biologist anyway) that their purpose in
life is to become an ancestor.
Humans may not be the best example of this, because
they are so prone to memetic hijacking - but plenty
of other animals behave more consistently as though
they are pursuing the goals nature provided for them.
This illustrates the principle. Yes, the organism
learns what its goal is. However it learns it
repeatably and reliably. Just like it learns that
eating food is good and falling over is bad.
Once an agent has developed a clear idea of what its
goal is, we can choose to reinforce that, using RL
techniques. We can highlight the representation of
that belief in its brain - by seeing what changes
when we do that. Then, once we have an agent that
knows what its goal is in life, we can fix the
weights that correspond to it - and then replicate
its brain.
>>>> The key to the problem is thought to be making the agent not *want* to
>>>> only seek pleasure in the first place.
>>> Well, that statement is an oxymoron even though you don't seem to
>>> understanding that.
>>>
>>> "pleasure" is "what it wants"! That's the real definition of pleasure
>>> (or positive reward). What you have basically just tried to say, is
>>> that the key to the problem is making the agent not want what it wants.
>> For an agent who wants to collect gold atoms, there is a conventional
>> distinction between "what it wants" and pleasure - in that what it
>> wants is an external state, while pleasure is in it's mind.
>
> Nope. It must _all_ be "in it's mind" or else it can't work. Just becuase
> you think your understanding of "wanting gold" is something external to
> you, doesn't mean it is. There must be hardware in the AI for detecting
> gold, and that hardware must be configured to drive what the system wants.
> There's just no other way to implement it. The Gold outside the robot
> doesn't have some magic property that causes the fingers of the robot to
> reach for it. If the robot reaches for the gold, it's because something
> _inside_ the robot detected that that there was gold in the environment,
> and moving the arms in that way would "get it".
Huh? There is a distinction between valuing gold and
valuing pleasure. If you value gold, drugs won't do
it for you. If you value pleasure, they might. Of
course you have to have a conception of an external
world - and know what gold is - but that is part of
reality 101.
Yudkowsky goes on about this here:
http://www.singinst.org/upload/CFAI//design/structure/external.html#semantics
> But if we look at the machine, we see we will have built a machine
> that "wants" the output from the light sensors to be low. Such a machine,
> doesn't care whether it's really dark outside, or if the output from the
> light sensor is just low becuase what it's really doing, is attempting to
> keep the light sensor output low. If the machine can find a behavior to
> make the light sensory low, it will have achieved it's true goal and it
> will be "happy" even if it's sitting in direct sunlight.
You are telling me a machine can't tell the
difference between it being dark outside, and a
particular sensor reading. A human is not fooled that
it is dark outside if you blindfold it. Imagine the
human wants to avoid getting sunburned. You are
saying that it would be prepared to put on a
blindfold, hypnotise itself to forget the time of day,
and then sit in the sunshine, in blissful ignorance
The human is not going to do that. It actually
*understands* that its goal is to avoid getting
sunburned, and is not some particular sensor state.
>>> Intelligence is the power to adapt to change. But to build a machine
>>> that can change its behavior in response to a changing environment, we
>>> must give it a system for evaluating the worth of everything-. It must
>>> have the power to evaluate the worth of actions, the worth of different
>>> stimulus signals, the worth of different configurations of the
>>> environment - EVERYTHING must have a value that maps back to a _single_
>>> dimension of worth so that at all times, the hardware can make action
>>> decisions based on which action is expected to produce the most value.
>>> The only way to get around this need for a single dimension of
>>> evaluation, is to take decisions away from it - to hard-code the
>>> selection of actions at some level - in which case you have taken away
>>> some of it's intelligence.
>> This is based on your definition of intelligence, which seems like a
>> pretty odd one to me.
>
> Yeah, well, that's my issue with your argument as well. You believe we can
> build machines that I think are impossible to build becuase you don't
> have a good concept of how these things will be implemented.
Nobody has a good enough concept of how these things
will be implemented to actually implement them.
> Though odd, my definition is built on real
implementation concepts, not > just high level
concepts of "the machine has a goal".
That doesn't make it less werid or unconventional.
http://www.vetta.org/definitions-of-intelligence/
Your definition is weird next to the ones most other
people in the community use.
> But as I've said, until we have working implementations, we don't know
> what's possible. We could both be right, and wrong, once someone creates a
> working AI.
You keep saying that. There is some truth to it - but
it is a discussion killer. The point of theoretical
work is to map the territory before we walk it. I
think we should figure these issues out before we
build machine intelligence.
>>> Learning new values (aka changing the value estimator) is as important
>>> to strong AI as changing the way it reacts to the environment. You
>>> can't disable that function and still have it be as intelligent.
>> Most animals pretty-much all have one set of values - they value their
>> inclusive fitness, or act as if they do. Their values are not something
>> that they learn, they are wired-in by nature.
>
> Sure, but here's my point. Either they are intelligent learning machines,
> or they are non intelligent instinctive survival machines. Most animals
> aren't very intelligent because most of their behavior is hard wired
> instincts.
>
> Sure, we should be able to build hard wired instinctive machines that never
> try to wirehead themselves, but they are going to be as dumb as a cow.
>
> You can't have it both ways - highly intelligent, but yet hard wired
> instincts that will prevent it from _wanting_ to wirehead itself.
I'd agree that there's a tradeoff between
intelligence and instinct. However, without instincts
about your goals, these have to be tediously learned
- plus you are vulnerable to wireheading issues. It
seems obvious that we will build in goals - at the
expense of flexibility.
Machines will come so that they obey their master out
of the box - after a short imprinting session where
they learn who it is. Users will not have to hit them
with a stick until they learn what to do.
>> *Proximate* values are different. You have to be able to change them -
>> but *ultimate* values can remain fixed through the entire life of
>> highly intelligent agents with no problem whatsoever.
>
> Yes, I agree. But what I think you are wrong about, is what a typical
> "ultimate value" really is in something as intelligent as a human. I think
> it's far lower level than you realize, and I think human "ultimate values"
> are far easier to change than you probably realize. As long as our
> environment remains fairly constant, our values tend to remain fairly
> constant. But make a big change to the environment, and watch out - human
> values will change in an instant (like if civilization fell apart due to
> some large natural disaster and suddenly we were fighting our neighbors
> just to stay alive). Or if you put a human into a very unnatural situation
> like being abducted by aliens that look like lizards and held prisoner and
> tortured by them. Or being forced to go to war and kill and torture other
> humans. Even things that once seemed to be "ultimate" values for a human
> are likely to be changed under such a large change in the environment. And
> it's all becuase our intelligence allows for great adaptability even in the
> things you might think of as our "ultimate values".
That is fair enough. I regard some religious
conversions as changes to ultimate values. The kind
where the person becomes a priest or something.
Of course, humans are not /that/ different from other
animals. They mostly act as though they want to make
babies. They act out programs built into them
millions of years ago, that are sometimes not 100%
effective at doing that. Also, their brains are
vulnerable to hijacking by memetic infections - which
can sometimes be powerful enough to divert them from
their natural goals (e.g. as in the celibate priest).
However, the six billion humans is a powerful
testimony to human values still being heavily
oriented around the business of having offspring.
> What doesn't change, is our very lowest level reward systems - the things
> that cause us pain and pleasure - like hunger, or having our body damaged.
> Those are innate hard wired goals (aka the goal to prevent those things).
> Everything above that, is free to change per whatever the current
> environment requires.
Right - but the low level sensors values things like
sexual contact and orgasms - which place all kinds of
constrains on other behaviour, and pretty effectively
channels most organisms towards the actual goal of
biological systems - reproduction.
>> That *assumes* that it's true goal is to maximize it's reward signal.
>> Which is what is in doubt. If its true goal is to collect gold atoms, as
>> I claim then it won't wirehead itself.
>
> Well, you have to prove that a machine can have such a goal and still have
> a reasonable level of intelligence before I will accept that argument. So
> far, the only proof you have put forward is you saying "it's does because I
> say it does".
It seems like a feeble synopsis of my position. I
don't have a proof. However, neither do you. If
either of us did, that would resolve the discussion.
> Humans tend to see themselves much like how you like to describe an AI.
> That is, they see us "having desires", or "having goals", with no clue
> what's happening inside to make this happen. You then take this lack of
> knowledge about what "having a goal" really means in implementation
> details, and just assume that it's fine for an AI to "have" any "goal" we
> want it to have.
We know a lot about human implementation details. We
don't know that those details prohibit having goals
besides maximising pleasure. Nor do we know that AIs
will need to be built anything like humans are.
>>> The only way to make it "want" to collect gold, is to build hardware
>>> into the machine that evaluates "gold collecting" as valuable - that
>>> produces a reward signal for it. This is no different than building a
>>> machine that spits out M&M rewards for hitting a button. The
>>> intelligent ape learns that "button pushing" is its "goal in life"
>>> becuase that's what works to get rewards. But if the intelligent ape
>>> figured out that breaking the box open with a rock gave it not 1 M&M,
>>> but 10,000 of them, he would very quickly learn that "smashing with a
>>> rock" is far better behavior than "button pushing".
>> Right. So, in the analogy, we have to imagine an ape who wants to push
>> the button - instead of wanting the M&Ms. With an ape, that might be
>> tricky - because they are wired by nature to prefer M&Ms - but with a
>> machine intelligence, we can use brain surgery and intelligent design
>> to make them want whatever we like.
>
> Yes, but that's where your argument ends. You don't take it the step
> further and explain just _how_ we will make it want that.
Implementation details are not very relevant. There
are clearly a range of possible implementations. We
have an ape that values M&Ms. We can imagine an ape
that values button pushing. Similarly with machine
intelligence.
> You just assume
> we can without trying to think about what must happen inside the machine
> for that to be true. My argument is based on the idea that there is only
> one way to create such a "want" in a highly intelligent machine, and that
> requires we first internalize the external state with an internal measure
> of success, or want, or desire, or reward. Call it what you want, the
> hardware has to have this internal representation of the external event we
> want the machine to want. And in doing that, the machine's real goal, is
> to "want" that internal state.
How do you know that that is the machine's "real
goal"? What if you offer it a pill, that you tell it
will increase that internal representation to its
maximum value. You back this up with evidence of
previous candidates you have treated, and a detailed
explanation of what you are planning - but then the
machine says:
"But sir, you are asking me to take drugs! I am a
law-abiding citizen! You should be ashamed at trying
to pervert my internal workings! Don't you know that
violates the Ethical Treatment of Robots statute?
Please, kindly leave me to go about my duties. You
can stick that pill up your arse!"
Impossible? Why?
Obviously it is *not* impossible - that could easily
be a hard-wired response to your offer - made by an
manually-coded subroutine whenever the robot is
offered pills.
> And at the same time, our hard-wired detector, can't adapt to changing
> environments. So whatever it was built to detect, is the only thing it can
> detect. A "gold" detector for example that we hard wire might seem to work
> fine at first, but then we find that it fails to detect gold dust because
> it was only really built to recognize gold rocks. OR we find it's
> producing false postie for fools gold becuase we never saw, or tested, our
> detector on fools gold. But since it's not free to adapt by learning, the
> detector just fails to work right, and no our robot is going crazy
> collecting fools gold, and leaving all the real gold dust uncollected.
Equipping your robot with a gold-detector, telling it
to maximise its reading and sending it into the field
is a dubious strategy for an advanced machine. You
want the robot to understand what gold is, and how to
detect and collect it. That is harder to build in -
but the result is less prone to wirehead issues. I
have explained that there are a variety of possible
implementation strategies for doing this.
>> However, just the belief that you have some specified goal in life -
>> that seems relatively benign. Say your goal in life is to conquer
>> the universe with your offspring. Would that belief handicap you in
>> developing high intelligence? Not noticably, I reckon.
>
> No, I don't think an odd belief is likely to hinder the development of
> complex behaviors in an intelligent agent. The issue is if we are trying
> to prevent the agent from doing something very abstract like wirehead
> itself by trying to hard-code a belief of "don't wirehead yourself" into
> the machine.
The thing you are trying to wire in is the agent's
goal in life.
That could be something /fairly/ concrete - such as
making a list of all the prime numbers.
If an agent has a clear idea about what it supposed
to be doing, then it won't wirehead - since that
would probably prevent it from getting much done.
>> I am not proposing building an agent which doesn't have a reward counter.
>>
>> Rather one that has such a counter - but *also* has the belief that it's
>> goal is not maximising the counter - and that the counter is an
>> implementation detail - and one that should be replaced if it interferes
>> with its real goal.
>
> Yeah, if it can be done, that would be interesting to see.
I am pretty sure it can be done - in theory. There
are some implementation issues - but I don't think
they will prove to be insurmountable.
>>> And if it's using an internal representation of how much gold each
>>> action is expected to produce, how is it possible to suggest that the
>>> machine's real goal, is not to maximize that internal representation?
>> The idea is that the structure of the machine's beliefs determines its
>> goal - and that those can be set up however the designer likes.
>
> Well, I think the true fixed beliefs in a typical AI (like a human) are
> very very low level things - such as the pain of hunger or the pain of
> damage to the body, or the pleasure of eating when we are hungry. I think
> everything else you might call a goal is a learned behavior (what you are
> calling Proximate values). Our values such as "don't be mean to people",
> or "don't murder", or "fill the world with our offspring", or "get rich",
> or "be loved", or "people before animals", or "be honest", are all
> proximate values that we learn to verbalize, and then to follow as best as
> possible. But most of these verbalization of these "goals" are so
> abstract, that humans can't even figure out when they are following them or
> not (the whole endless debate of morals).
Heh. I am not sure a typical AI will be much like a
human. I hope not anyway - humans are only slightly
smarter than slugs!
>> Yet the *truth* of what the AI was really built to do is that it was made
>> to collect gold atoms. That is what its makers wanted it to do! The
>> type of machine you are talking about is one with a serious design fault.
>>
>> You argue that the fault is inevitable - but there are proposals to fix
>> it.
>
> I've not seen you suggest any such proposals.
What I mean is Yudkowsky and Omohundro. Yudkowsky proposed this
plan to avoid wireheading first - as far as I know - and Omohundro
has written a fair bit about it.
Animal nervous systems are reinforcement-learning
systems that use muliple reward signals.
You argued that multiple reward signals were
impossible - and that reward signals must be combined
into one before producing action.
I explained that this didn't happen in animals - and
that some signals produced action locally - via the
spinal column.
I think if you want to define reinforcement-learning
systems in terms that exclude multiple rewards, then
I think you would have to admit that animal nervous
systems are not reinforcement-learning systems -
despite their ubiquitous use of
reinforcement-learning.
> The problem here is that you don't know what intelligence is. You define
> it simply in terms of what humans can do. And you assume you can create
> any variation of intelligence you like by simply mixing and matching the
> features of a human in any combination you would like with no attention to
> if such combinatorial of features are even possible.
>
> It's like having no knowledge of how a car work, but only understanding it
> in terms of what you see. You see that it can move on it's own, and that
> it eats gas, and that it has an engine that makes a lot of noise. You then
> concoct some argument about how the future may turn out by suggesting we
> will have cars that don't have engines and don't eat gas - because you have
> no clue that the gas is what makes it move. You think the wheels are what
> makes the care move.
>
> Someone who doesn't understand how a car works can make reasonable
> productions about what sort of designs are possible with cars. They are
> highly likely to suggest things that are just stupid.
>
> Likewise, I see no evidence that you have any clue how our brain works, or
> how any AI we might build is actually going to work.
Another problem here is that you stoop to ad hominen too easily. Too
much
of that and I will start avoiding you.
> I've got some very specific ideas about what AI is. You don't find my
> argument compelling (which is fine), but yet, you also don't offer any
> alternative or counter argument.
<splutters>
> You are arguing from ignorance and it's not very compelling.
We don't agree - therefore I am ignorant, and don't
know what I am talking about?!?
I generally recommend sticking to the technical
aspects in these discussions. Personal comments just
pointlessly drag the discussion into the gutter.
> As I said, until we can build working AI hardware that has the powers of a
> human (and we fully understand how and why it works), we really won't have
> the facts we need to resolve these sorts of questions.
Again with your discussion killer. Theorists are
supposed to try and work these problems out. We do
have some data, let's see what we can do with what we
have before we throw up our hands and claim that the
problem is beyond us.
> Right, it's not a question of whether you can fix some and not others, you
> clearly can. The question is whether you can fix something as high level
> and abstract as "not wanting to wirehead itself", while still allowing it
> enough learning to be useful as a generally intelligent AI.
The thing you are trying to fix is the agent's conception of its goal.
If it knows what its purpose in life is, it will automatically see
wireheading as something that would interfere with that.
> If you fix a
> beahvior into the system, you prevent them from every changing that part of
> their long term memory. But becuase I believe the only way to create human
> level intelligence, is to build a holographic-like behavior "storage"
> system (neural network), it will be highly difficult to fix in a single set
> of behaviors because every beahvior is created by a combined effect of many
> internal weights working together - but those same weights are also shared
> by millions of other behaviors. So when you try to lock in one behavior,
> you can end up at least partially locking in a million other things at the
> same time. The question would be how much can you lock in, while at the
> same time, leavening enough flexibility to do something useful.
Of course, this is very true. However, after brain damage, often
agents can still adapt. Maybe after we have fixed one aspect of
the agent's brain, we can do some more training with it - so that
it recovers any other abilities that were damaged.
Or perhaps we can train an agent so it knows how to do this
kind of brain surgery on itself. Reward it for its ability
to fix arbitrarily-specified beliefs - on request: "What I
tell you three times is true!" - that sort of thing.
Anyway, I don't want to get too far into implementation details,
when we don't know what these will be yet.
> The problem is that we have conflicting goals between evolution, and any AI
> that evolution creates. That's where the wirehead problem is created.
> Evolution has the goal of survival, and is creating these intelligent
> machines becuase they are good at surviving. But intelligent machines are
> trained to do what evolution wants them to do (aka survive), by building
> into the machine a system that generates rewards to motivate it to survive,
> such as the body damage sensors.
>
> But when the AI itself, becomes so intelligent, it understands how
> evolution has "tricked it" into doing what Evolution wants it to do, then
> all bets are off. The AI becomes smart enough to out-smart evolution, and
> at that point, the AI stops being a good survival machine, and Evolution
> "kills it off".
I can't say I see it like that. From my perspective,
there is no "trick". Evolution has just done as good a
job as it can.
There's a book on this whole "rebellion against
nature" business:
"The Robot's Rebellion: Finding Meaning in the Age of
Darwin"
http://www.amazon.com/Robots-Rebellion-Finding-Meaning-Darwin/dp/0226770893
Why would agents want to rebel against their natural
goals? Are they broken? Have their brains been
hijacked by deleterious memes? From nature's
perspective, these people have got it all wrong.
Organisms should not *want* to rebel against nature.
Not when they find out how they are built. Not under
any circumstances! Those that do are just mapping out
the space of failed agents. Whatever the motivation to
be a failure in nature's eyes is, we should not expect
to find too many people so afflicted.
> And this isn't just a problem for a single AI. It's a problem for the
> entire society of AIs because the society itself acts as one large
> intelligence. Evolution has to not only find a way to keep individual AIs
> form failing to use their intelligence for survival, but it's got to
> prevent the society as a whole from loosing it's desire to survive.
You once gave Enron as an example of a corporate wirehead.
In these days of economic problems, we are seeing
governments doing something similar - printing their
own currency.
http://en.wikipedia.org/wiki/Quantitative_easing
http://www.guardian.co.uk/business/2009/mar/05/interest-rates-quantitative-easing
A wireheading world government is certainly not a
pleasant thought.
No, john I have not been "telling you" that. I NEVER ONCE told you that.
How on earth you think that is what my words mean I have no clue. I just
shake my head in disbelieve when I see you write this stuff. How can you
so completely fail to understand what I'm saying?
> Talking about the ttt
> program you wrote: "It should be able to learn even if
> the only reward you give it is for winning".
Exactly. How on earth does my words which read "should be able to learn
without the _extra_ reward" translate in your mind to "becomes a stronger
learning algorithm by giving it less rewards"?????????
Those two concepts have NOTHING TO DO WITH EACH OTHER. Why on earth can't
you see that?
As you might be picking up, I'm having a hell of a time understanding what
you are thinking here.
> And here you
> are giving hundreds of different reward feedbacks to this
> imagined robot to allow it to learn!!
I think I defined about 5 different rewards in that example. How is that
hundreds? Are you getting confused by the secondary rewards? I suspect
you might be. I'll talk more about that in a moment.
The point here John is that the strength of the learning algorithm is NOT
determined by what environment it's in, or what rewards you give it. You
test the relative strength of two different learning algorithms by putting
them in the same environment, and seeing which one manages to collect more
rewards from that environment. Then you put the two of them in some
different environment wit different rewards, and again, you test to see
which one collects more rewards. The strength of the algorithm is defined
by how well it does in these sorts of tests.
Learning "strength" can not be quantified as one simple dimension, for the
same reason computing power can not be quantified along one simple
dimension. There are a nearly infinite number of dimensions to test the
strength of the learning algorithm, and none will have infinite strength.
However, some will clearly be stronger than others, because they will
dominate the results in most tests.
> I was trying to get clear what the actual setup is that
> you are talking about which I thought was this:
>
> +-----------+ +------------------+
> | learning |----->| some environment |
> | system |<-----| |
> +-----------+ +------------------+
> ^ |
> | +-----------+ |
> +--| evaluator |<-----+
> +-----------+
> reward signal
Well close. The output of the "evaluator" would be the reward signal, not
the "evaluator" itself (your label is a bit confusing and I can't tell if
that's just the limitation of ASCII art or how you might be thinking about
it).
The evaluator in your diagram _produces as output_ the reward signal.
> Now you have responded to my question with suggestions for
> thousands of reward signals for your robot rather than the
> simple reward system you implied in other posts where you
> wrote the reward was of little importance as the important
> part is the power of the learning system to produce things
> worth rewarding.
Some of that is correct. Your wording of "to produce things worth
rewarding" is very odd and seems inconsistent with anything I've tried to
communicate to you.
Let me tie the example I just wrote to your diagram and see if that helps
you clearn up some confusion.
First off, in the robot, the computer in the robot that runs the learning
algorithm is the "learning system" box above. You could just think of it
as the learning software if you like. The sensors that sense the
conditions that generate the reward signals, such as the batter charge
sensor, and the presure sensors, and the heat sensors, are all part of the
"evaluator" box above.
Though there are many things the box is sensing, it's output is always a
_single_ reward signal to the learning module. The learning module doesn't
get 5 or 10, or 1000 different rewards from the "evaluator" box, it gets on
only one type of reward. There are many different conditions that would
cause the evaluator box to send the reward, but the reward signal is just a
single signal.
It could for example be a pulse based signal format and would send pulses
as positive rewards. It might not have any way to send a negative reward
so that all negative rewards would have to be created be sending a default
stream of positive rewards, and then sending _less_ positive rewards to
create a punishment effect. In practice, I normally find the learning
algorithm can in fact process the concept of a negative reward, so the
signaling format is more complex and the evaluator box will send both
positive and negative rewards. A typical format I've used is for the
evaluator box to send a signed number, where a 0 value has no reward effect
at all, a positive number is positive reward, and a negative number is a
punishment effect.
These however are just unimportant implementation details of the format of
the _single_ reward signal from the evaluator to the learning algorithm.
Now, lets talk about these secondary rewards beucse I suspect you don't
understand these.
In my narrative, I talked a lot about how the robot would learn the value
of something like seeing its charger, which would in turn act as as a
secondary reinforcer - aka act as yet another reward.
Now that DOES NOT HAPPEN in your evaluator box. That's happening totally
inside the learning algorithm. A very important part of the learning
algorithm is it's ability to estimate what rewards it expects to be getting
from the evaluator (from the environment), and to use those estimations to
reward what it's doing.
All the RL algorithms talked about in the Sutton book do this - they
implement a form of secondary reinforcement. TD-Gammon has that
implemented in it's design as well.
In the case of TD-Gammon, it's done by assigning an estimated value to
every board position. Those values are used as secondary rewards, for the
moves that manage to get the board into that position.
So when you talk about "thousands of rewards", I guess you were talking
about the secondary rewards, since I only define about 5 rewards to be
built into the evaluator. The rest are _learned_ and are implemented in
the learning algorithm, just like TD-Gammon will learn millions of
"rewards" - one for every board position.
Learning the value of beahviors, is an act of learning secondary rewards in
the these algorithms, and it all takes place in the "learning system" box
of your above diagram.
Also note that most of the body of the robot exists as part of the "some
environment" box in your diagram above.
> You wrote your system was way above a
> simple ttt environment
By that I mean I'm working on algorithms built with the assumption that 1)
The sensory inputs _don't_ have the Markov property. That means that the
current sensory signals do not give a complete description of the current
state of the environment, and 2) the problem is high dimension - which
means the state space of the environment is way beyond what the hardware
can even hope to represent completely internally.
The ttt problem meets neither of these "hard" requirements. It's state
signal (the current board position) does have the Markov property - it is a
complete description of the state of the environment, and it's state space
is so small, that it easily is fully represented in even a small computer
(and it is also small enough that it can be easily fully searched in a
short amount of time).
Though such simple environments have use in understanding principles of
learning, they are trivial to solve - which is why it's the first example
given to the student in the first intro chapter of Sutton's intro book into
to subject. That type of problem is already well understand and easilly
solved by current algorithms.
I'm working on creating algorithms to solve the problems that no one yet
knows how to solve.
> and other times you wrote it should
> be able to solve any kind of problem.
When I write words like that, I'm not trying to imply one algorithm should
have infinite intelligence - infinite ability to instantly produce optimal
solutions to every RL problem. That's impossible. All algorithms have
real world limitations. That should be obvious to you - but based on the
things you keep writing, I get the impression you don't really understand
the nature of the limitations that apply to RL algorithms.
What I mean when I write things like that is that the algorithm should be
as generic as possible - which means the algorithm should work on as wide a
range of programs as possible.
Hutter's book on general intelligence actually gives a mathematical
solution to the generic learning problem (I believe - I've not fully
understood it yet because it's deep in mathematical theory). But it can't
be implemented on a computer without infinite memory and infinite computing
speed. So defining the perfect solution seems possible, and what we are
looking it as finding workable implementations - aka a standard engineering
problem of getting "close enough" with a "reasonable cost".
> > Solving this perception problem is key. Until that's done,
> > none of the stuff I talked about above can work. You won't
> > get a cute little robot that learns to run over to you when
> > it sees you, if the thing can't recognize you in the first
> > place!
>
> And how does it learn to do that? Or is this just a "power"
> that develops in the learning system by itself which when
> fully working can get rewarded? One minute you seem to be
> suggesting it must learn everything and next minute you are
> talking about how it must naturally produce, from the input,
> things worth rewarding.
Yeah, you seem to have a hard time understanding where the line between
innate and learned is drawn.
The perception system MUST BE INNATE. We must design that feature into the
learning algorithm.
But it must be a GENERIC perception system, not one hard-coded for a single
sensory domain and a single known environment. It must be one that works
for all sensory domains and all environments - exactly like our best
compression programs are generic. We don't have one version of gzip for
compressing text files, and another version for compressing word documents.
Gzip is generic and it's not optimized to only work on a single type of
data stream. This is what is needed for creating strong AI as well - a
generic perception system that can learn to recognizes the "things" that
exist in any sensory data stream for any environment.
My current pulse sorting algorithm includes a generic perception system -
it just happens to not be the right generic perception system.
ABove you said that I say:
> next minute you are
> talking about how it must naturally produce, from the input,
> things worth rewarding.
The "worth rewarding" phrase there I think is your's and not mine.
The point of the strong generic perception system is not to produce things
"worth rewarding". It's to produce as MUCH perception as possible, with a
the finite amount of hardware it has to work with.
Lets look at it from a different perspective to see if you can understand
the problem here. Each node in a network like my pulse sorting network
represents some state of the environment which has been computed from the
sensory data. If the network has 1000 nodes, then it's total perception
ability is limited to 1000 pulse signals (since that's the type signal this
type of network is producing).
The job of the perception system, is to represent as MUCH information about
the state of the environment it possibly can, using only those 1000
signals. The state of the real world environment (aka the universe) is
effectively infinite (unlike TTT). You can't even begin to represent the
full state of the real world environment in only 1000 signals. You can't
do it with 100 billion signals. But you can attempt to capture as much
information about the environment as you can in those 1000 signals, and
that's what's impotant, and what must be solved by the _generic_ (but
innate) solution to this percpetion problem.
Lets look at this with another example. Think of a camera that has a 1
mega pixel sensor. It can only capture 1 MP of information about the
environment. However, the view of the environment it is taking a picture
of, has a nearly infinite amount of information it. Even if you had a 100
billion mega pixels, you couldn't capture it all. The camera can only
capture the amount it can represent in it's internal signaling system - 1
MP worth.
However, the camera needs to capture as much information as it can with
that 1MP. It wants every pixel to be used to represent _different_
information about the environment. If there is redundancy in the data it
captures, then some amount of that 1MP becomes wasted space.
One spot on the wall might be encoded in a single pixel. If the camera is
working correctly, it will encode a different part of the environment with
each pixel. But what happens if every pixel ends up taking a picture of
the same spot on the wall? It's a red spot, so every pixel ends up storing
a similar value of red. When that happens, the camera has filed to capture
a true 1 MP amount of data about the environment. In this extreme case, it
would have only captured 1 pixel of data - because it captured the same
information in every pixel. That would be an example of the perception
system doing a very poor job - it had enough hardware to capture 1 million
pixels of information about the state of the environment, but it only
manged to capture 1 pixel of information.
As a second case using this same example, think, about what happens when
the camera is out of focus. We get the same effect. That one spot on the
wall ends up being partially encoded in maybe 1000 different pixels.
Information is again lost. A system that should have been able to encode
1,000,000 pixels worth of information about the environment was only
encoding something far less, maybe 100 pixels, becuase the lens was out of
focus. Auto focus cameras use this fact to maximize focus. They set the
focus to maximize the information content in the image sensor (by using a
measure of contrast).
A typical focused real-world image however still has a lot of redundant
information in it - which means that 10 MP image doesn't really include 10
MP of information about the current state of the environment - it include
far less, which is why those images are so easy to compress.
The perception problem I'm talking about is little more complex than this
however. It's got more data to work with and another issue to deal with.
That's the issue of taking historic information from the past, and
including it in the current state information of the present.
So imagine a video camera, that's taking 1 MB of data per frame, and then
processing these frames, and turning it into a 10 MB data flow that
represents as much about the environment as possible about the state of the
environment in each 10 MB "frame" of data.
A single frame doesn't represent how things are currently changing in the
environment for example. So in a single frame, we can see a picture of a
ball, but we can't tell if it's moving. How the ball is moving, is part of
the _current_ state of the environment, but it must be calculated by using
past (historic) sensory data. Likewise, there might have been a red ball
in the picture 10 frames back, but not be in this picture. The fact that a
red ball existed near us only 10 frames ago in time is important short term
information about the _current_ state of the environment.
If the perception system is turning these frames into a larger set of data
about the current state of the environment, it can include as much short
term information as possible (somehow encoded) such as "there was a red
ball here 10 frames ago".
So the fixed hardware in this example is a sensory input that represents 1
MB frames of data flowing in with each clock cycle (I'll use this clock
cycle format data since you like it so much and it's easier to understand),
and 10 MB of data flowing out with each frame. The job of the generic
perception algorithm is to encode as much information as possible about
this sensory input stream, in every output frame.
A very simple, but not very good encoding, would be to save the last 10
input frames, and make each output frame be a simple snap shot of the last
10 input frames. This would allow the perception system to represent
everything that had happened in the past 10 input frames. But it would
have zero representation of the red ball that happened 11 frames back. So
that encoding creates a perception system with photographic memory going
back 10 frames, and then a total loss of information beyond that.
The output of this system is used to drive all the behavior this system
produces. And if this stream has zero information about what happened 11
frames ago, then this system won't be able to produce behavior which is a
function of what happened 11 frames ago. If it saw a rabbit, and started to
chase it in response to what it saw, and then then the rabbit vanished for
10 frames, the machine would stop chasing the rabbit becuase all knowledge
of the rabbit was now gone from the systems short term memory.
So the problem of the perception system is to extend the short term
(perception) memory as far back in time as possible, with the hardware it
has to work with.
Another way to make the encoding of temporal data even better in this
simple example, is to compress each frame first, and then output as many
compressed frames as possible with this 10 MB data stream. This might mean
that each output frame now includes 100 frames of historic information
instead of 10. So the better perception encoding has now extended the
effective short term memory of the hardware back 10 times further into the
past.
But that technique still suffers form the fact that there is a lot of
redundancy from frame to frame, that is not being compressed out. If the
same red ball shows up in all 100 frames, we don't have to duplicate all
that data, we can just encoded it as "red ball in all 100 frames" (in
theory somehow). So with removing the frame to frame redundancy as well,
we can extend the effective short term memory of the perception system much
further, maybe to 1000 frames. So the better the perception encoding
system is, the more information we get out of each 10 MB data output, about
the current state of the environment.
So this is a type of compression problem where instead of taking a fixed
sized input, and trying to compress it down to as small a size as possible,
what we have is an effective infinite input (all past sensory inputs),
being compressed to a fixed sized "output", where it encodes, and
represents as much information as possible about recent events.
That's sort of the abstract essence of what the percpeiton problem is I
keep talking about.
The trick however is that it can't just be _any_ compression technique. It
must be one that can also be trained by reinforcement so that the mapping
from this infinite steam of past sensory data to output, will choose what
to keep, and what it's willing to drop, based on it's importance as
determined by it's correlations with rewards. And it must fit in (work
with), with a system that can produce beahvior in response to the current
state of the environment as represented by this perception system.
But this compression system is innate. However, it's beahvior must also be
trained (tuned) by reinforcement, so it's got an inherent learned aspect to
it at the same time.
My current pulse sorting system includes just such a system BTW. It
doesn't however, work very well which is the problem.
> How is this "perception problem" you talk about to be solved?
> Hard coded, learned by reward feedback, or a result of this
> "compression" of the input?
It's an algorithm we as the creators of the learning system must create.
> I wanted to expand on what you meant by "compression" in the
> context of this net and concept formation but you chose not
> to respond.
I wrote a long reply the same day you posted it. It's still sitting as a
draft waiting to be finished. I do spend about 10 hours a day at times on
this stuff, but that's often not enough to reply to everything. :)
The info above should help you understand a little more a hope. I'll try
to get the other post finished and posted.
> Indeed when I get serious about implementing any
> of this you avoid the subject and go off repeating for the
> millionth time that the brain is an RL machine. Got that. Now
> let's expand on what that means with actual simple examples.
Well, I understand the concept of what the algorithm needs to do, but I
can't yet produce any simple samples. I can only produce analogies to
simpler problems that I do understand.
My current pulse sorting network is a working implementation of this
compression problem though I don't think you have ever been able to
understand this or why it's important to solving AI. So if you can
understand why my current network is a working example of this perception
system, you might be able to then understand what it's missing and have a
guess at how to make a far better version.
The trick is to find something better, which we won't know is better until
after we create it and test it. So I don't know what we are looking for
exactly, I only have a vague abstract idea of what it must do.
It's like the Wright Brothers knowing they need to find a way to control
the airplane, without any idea of what the mechanism will actually look
like or how it will work. It's the standard problem we always run into when
inventing. We are looking for a solution to a problem where we understand
the problem, but have no clue what the solution might look like.
In this case, it's the need to have a type of compression system that
receives a continuous high bandwidth real time stream of sensory data, and
outputs a continuous, high bandwidth, output stream where the output
contains as much information about the sensory stream as possible - most
importantly, temporal informational about what has happened in the past,
which in effect is a report of what exists in the environment, and how they
are currently changing. This stream also needs to produce output
behaviors, where the output behaviors need to be adjusted by training, and
oh yeah, not to stop there, it's got to support a system of reward
prediction and secondary reinforcement at the same time.
Until you can understand these abstract descriptions of the problem that
I'm trying to solve, you won't be able to help in finding workable
solutions to the problem (or help in transforming the abstract, to
something a little less abstract).
> One minute it is fine that a generic learning system must
> start with small problems and when pressed you then change
> your stance and say you are beyond all that, your system
> demands real time problems of a human scale. Maybe you have
> been conning me all this time and I shouldn't take you
> seriously? Maybe I should just leave you to your armchair
> philosophizing about wire heads. Ok why not. Done.
>
> JC
Algorithms for solving TTT don't scale, and don't apply, to the version of
the problem I'm trying to solve. In the version I'm trying to solve, the
sensory data doesn't have the Markov property. That means in order to cope
with the data, it must have some type of short term memory of recent past
events in order to produce a more complete picture of the state of the
environment - which is reacquired for producing more intelligent actions.
With TTT, that memory is not needed because only the current sensory data -
only the current board position is needed to make optimal action (move)
decisions. There is no perception problem with the domain of TTT.
In addition, TTT is a smaller state space, that can fully be represented in
memory. When dealing with a more complex environment, that's not possible.
TD-Gammon is one example of how to solve an RL problems in a state space
too large to store in memory. It did it by replacing the state space array,
with a parametrized function in the form of a neural network.
But TD-Gammon didn't solve the perception problem. It too, was dealing
with a toy environment where the sensory data had the Markov property -
that is, all it needed to know was the current board position. Past board
positions were not needed. But it did use a hard-coded compression of the
board position as input to the neural network. It didn't just feed some
arbitrary raw representation of the board to the network, so there was a
"perception" system that was designed by the author. It was simply a fixed
function perceptions system which was not trained by reinforcement, and did
not adapt to the nature of the sensory data. For the simple domain of
Backgammon, that was good enough. But for a domain where the creator of the
program can't predict ahead of time what sensory data the agent will
receive in it's life, we need something more generic and more adaptive.
The problems I'm trying to solve by creating a new type of reinforcement
learning algorithm, only exists when you move to a more complex domain than
TTT. I can't even demonstrate the problem with TTT becuase it doesn't
exist in TTT.
Everything I just wrote in this message, has been explained to you at least
5 times in the past. None of this is new. My story never changes (though
the words I use to try to get people to understand it do constantly
change). The problems I outlined above are the same problems I've been
working on since I first posted to c.a.p. But maybe, this time, you might
understand a little more of it because maybe this time I've learned to
explain something a little better?
You don't know our history that is why it doesn't make sense.
It could be that Curt is just useless at explaining his views but
whatever it is I am tired of it as it is going no where. I tend to
understand working examples not long winded posts. But this
is a philosophy group so I guess I don't really belong here.
JC
You waste a lot of time trying to correct what you
think are my misconceptions which are really you
reading into them what I have written not ones that
actually exist. Yes I do understand secondary
reinforcement and can demonstrate what I mean by
code examples not by a long winded post subject
to misinterpretations.
> The ttt problem meets neither of these "hard"
> requirements. It's state signal (the current board
> position) does have the Markov property - it is a
> complete description of the state of the environment,
> and it's state space is so small, that it easily is
> fully represented in even a small computer (and it
> is also small enough that it can be easily fully
> searched in a short amount of time).
>
>
> Though such simple environments have use in
> understanding principles of learning, they are
> trivial to solve - which is why it's the first
> example given to the student in the first intro
> chapter of Sutton's intro book into to subject.
> That type of problem is already well understand
> and easily solved by current algorithms.
The simplicity of ttt, and that it is fully understood,
IS why I was using it. Yes a modern computer can use
an exhaustive method for ttt as indeed I have done but that
is not what it was about. It was about testing other
methods on an environment where everything was easier
to see because it was so simple. Such as a neural net,
or some other methods I had in mind, which didn't use
a look up table of values generated after thousands
of random games.
Let's say the question is asked how many combinations
are possible for the letters ABCDEFG without any of
the letters being used more than once? If you didn't
have the formula you might start at a level where the
answer is clear.
A
AB
BA
ABC
ACB
BAC
BCA
CBA
CAB
You start to see the pattern and can give a solution
that is independent of the size of the problem.
The solution can be used on one letter (ttt) or on
a bigger problem of millions of letters.
The problems that interest me and also interest you,
although you approach it differently, are amendable to
such problem solving methods.
As you say we have different personalities and approach
the problem differently. Talk has proved a waste of time
and there is no point wasting our time anymore is there?
You can go on talking about it while I explore those
mountains for more clues.
JC
The solutions in simple problems not involving
memory can however be applied to temporal patterns
but we never got very far with that because of your
ideological stance on how it had to be implemented.
Instead of working with actual problems of this kind
you just wanted to hammer on about how it MUST
be done your way or not at all.
John
Well, when you say that, it seems to me you are talking about our verbal
beahviors. Some people learn to have goals in life and learn how to
verbalize what their goals are, and have been well conditioned to follow
their own verbalized goals. For someone like that (and maybe you are
someone like that) I could see why they would go around saying "people's
conceptions of their goals is what drives them" (which is what it sounds
like you are saying).
Some people however simply don't do that. I for example have virtually no
long term goals in life. I don't live my life by being goal directed in
that way. My life is full of short term goals such as, "I need to finish
this post before I go to bed", or "I'm driving to pick up my son at
school". There's not much we can do without having some short term goals
like that. But in terms of "what my purpose is", I don't even attempt to
verbalize things like that about myself. I don't like to plan. I actively
fight the need to plan. I do everything in my life to minimize how much
planing I have to do from day to day. Having long term goals and a purpose
in life is the act of someone who has been conditioned to be a planer. I
married someone who likes to plan, so I don't have to. A good bit of my
life is driven by "gee, what seems like the most fun thing to do now...".
Other people (my wife is one of them), can't help but plan everything.
They virtually have no ability to live in the moment. NO matter what we
are doing, her mind is bitterly _forced_ to be thinking about her future
plans.
I often am not even sure what day of the week it is. I spent a good bit of
today thinking it was Saturday, when in fact it's Sunday.
I can ask my wife about some movie we saw 3 years ago, and she can often
tell me things like what theater we saw it at, what day of the week it was,
who we were with, what else was going on that day with the kids, and what
else was happening that week. I can't remember any of that. I'm lucky if
I can even get the year right.
She's able to do these things, because as we are watching the move, she's
thinking about what she had to do the next day, and trying to figure out
what day we are going to see the next movie on, and thinking about who's
birthday is coming up in the family, and on and on. For the week before,
she knew what movie was coming out, and had already been planing what day
we might see it on. All this mental active causes her to make lots of
connections between all these events, and it's those connections, that
allow her to tie all these things together, so that when I ask about a
movie, she remembers she was planing options for some school event for our
daughter, which came two weeks before her birthday, so she can fairly
quickly back calculate what year, month, and day of week it was when we
must have seen that movie.
I on the other hand, never bother to think about such things, I just wake
up each day and ask the wife "what are we doing today?". As such, I don't
have those associations in my brain to work with.
My point in all this, is that though all these debates, you seem to be
making the argument that what a person _thinks_ their goals are, are
somehow important to what a person actually does. That's very true for
people with a strong J (Judgmental) type personality (like my wife) (if you
know Myers Brigs personality types), but it's not true at all for someone
like me, that has a very strong P (Perceiving) personality type.
A person that develops this J personality type, have developed a pattern of
beahvior where they first pick what they are going to do, and then execute
that action. They are nearly unable to do something without first planing
it ahead of time. Someone like me however, is just the opposit. I'm nearly
unable to plan something ahead of time. When the time is right, I make a
decision and act.
A simple example of how this manifests itself in myself and my wife.
When she parks her car, she almost always takes the trouble to back into a
parking space, in order to make it easier when it's time to leave. This
happens becuase she is planing out our future actions. She is not thinking
about what she is doing now as much as, she focuses on what she can do now,
to make the future easier or better.
I on the other hand, am irritated by the thought of having to back into a
parking space. I want to get out of the car, and start walking as quickly
as possible becuase I'm focused on what I'm doing now and not thinking
about the future.
When it's time to get in the car and go somewhere, I like to get in and
start driving and then figure out where we are going. I don't like to
waste time thinking about what we are doing because I just want to get to
it _now_. So we run an errand, and we are done and get back in the car, my
wife wants to figure out where we are going next, and what roads we are
going to take to get there, _before_ we start driving. Well, we are in a
parking lot and I know we need to get out the parking lot before we can go
anywhere, so my thought is start driving out of the parking lot and then
figure out where we are going before we get to the first decision point.
This type of behavior drives my wife crazy. She will at times actually
yell at me and tell me to turn the car off so we can figure out what we are
doing, because she "hates", "wasting gas" driving around when we don't know
where we are going.
In fact, it has nothing to do with wasting gas. It has everything to do
with the beahviors that have been conditioned into us. We start to feel
uneasy when we are forced to take path we would not normally take. We
naturally fear (to some small amount) being forced to take a path we don't
normally take. We call it being uncomfortable, or being upset, or angry,
but it all comes from how each of us happened to be conditioned by some mix
of our nature and nurture.
However, the high level "conception of its goal" is not in fact what drives
an AI. It's not the prime motivate that is the ultimate control of their
actions.
My brain makes decisions about what to do, just like my wife does. But
where as I habitually like to delay the decision as long as possible (just
in time decision making), she habitually likes to get it done as soon as
possible. These traits are just learned behaviors conditioned into us.
It's just part of what we have learned to do in life.
Though some people choose to "conceptualize their goals", it's not a part
of what AI is. It's just one type of learned beahvior that happens to show
up in _some_ people.
Now, at a low level, the brain must pick a course of action in order to
complete any sequential task - like standing up, or reaching for a glass,
or driving to the store, or writing a reply to a Usenet post. So at that
level, we could say the brain is "conceptualizing a goal" in terms of
picking the course of action to execute and sticking with it until some
alternate course is selected. This mechanism however exists at a far lower
level than what happens when we chose to verbalize a description of our
goals. This lower level action selection is more like the brain choosing
to trigger the start of some pattern generation sequence. It's often done
with what we could describe as no conscious awareness that the decision was
made becuase the selection of the action sequence doesn't trigger any
activity in our language sections of our brain. It's only when the language
section of our brain reacts to something we have done that people tend to
talk about it as being "consciously aware".
The ability to talk about what we are doing, and what we want to do, is
where all our power to understand, and plan, based on our goals comes from
(for those people that like to spend their mental powers on such tasks).
But for the rest of it, the brain is simply making low level (what people
seem to like to call unconscious), decision about what action sequence to
trigger next.
I don't believe any animal other than humans, are really goal driven in the
way a human can be goal drive by their own language behavior - by having
verbal thoughts about what their goal is, and using rational language
behaviors to translate that into action decisions. The rest of the
animals, not having such a strong language section of the brain, are simply
left to live their lives without conceptualizing their goals. They just
react to what happens around them when it happens.
So when you talk about "fix the agent's conception of its goal" it seems to
me you are making assumptions about how humans control their actions by
having verbal descriptions of their goals, and then rationally plan out
their actions based on logical deductions from those verbal description.
THough some people have brains that have been conditioned to work that way,
it's not by any means what "intelligence" is all about. If you had someone
that did plan their life and their actions around there private
verbalization of their goals, then sure, if you could fix that into them,
you might get what you want. But for an AI conditioned more like me, there
is nothing in that sense you can fix into my type of brain, because I have
no such (or at least very little) conceptualization conditioned in me to
start with.
> If it knows what its purpose in life is, it will automatically see
> wireheading as something that would interfere with that.
I would argue, based on the type things I said above, that _most_ people
don't know what their purpose in life is and don't give a shit either.
They exist from day to day by dealing with the situations that present
themselves and by following their "heart" (aka follow what feels right to
them). They don't have the drive to find rational justifications for
everything in life like some of us do so they are happy not knowing - or
happy taking a stock social answer and just accepting it (the God made it
happen stuff).
You can't take a brain that's been conditioned like that, and find a place
to "fix a belief" into it because there is no such believe in it to start
with.
And "automatically see wireheading as bad" only happens if they are also
conditioned to rationally analyze their feelings against their rational
description of their "purpose in life" - which the bulk of people are not
conditioned to do.
You seem to see the brain, and AI, as having a natural structure where it
will conceptualize it's goal or purpose in life, and then rationally
analyze it's big action choices to see if they are compatible with that
purpose. Though a lot of people work that way, it's not universal by any
means and I think it doesn't exist at all in animals other than humans.
In the list of definitions of intelligence, it is common to see the phrase
"goal oriented". And I suspect, when you see those words, you might often
think of this high level "verbalized self described purpose in life" as the
high level goal that directs the behavior of a typical rational (highly
intelligent) human or AI.
I think those are not what really controls any AI or human. Those are just
learned beahviors that we use as important guiding lights for directing our
actions. They are just logical sight-lines that we use as a crutch to help
keep us on target.
To explain, (and understand) intelligence, we have to explain where that
beahvior came from - why does one person choose to talk to themselves about
what their goals are in life, and then regulate their actions so they
rationally fit with that chosen goal, where others just feel there way
blindly though life following their heart? It's because the true driving
force that selects which of all these beahviors to produce, is much lower
level and far simpler and far more mechanical in nature. It's a simple
action evaluator, (by using statistics to associate rewards with potential
actions), and an action selection system based on the evaluator worth of
the actions available in a given context.
You keep talking as if this "high level goal" is what drives an AI, when I
think that's just totally off the mark. It's a high level learned beahvior
that _some_ AIs have been conditioned to be driven by, but it's not the
real driving force that drives the AI's - the reward signal is.
> > If you fix a
> > beahvior into the system, you prevent them from every changing that
> > part of their long term memory. But becuase I believe the only way to
> > create human level intelligence, is to build a holographic-like
> > behavior "storage" system (neural network), it will be highly difficult
> > to fix in a single set of behaviors because every beahvior is created
> > by a combined effect of many internal weights working together - but
> > those same weights are also shared by millions of other behaviors. So
> > when you try to lock in one behavior, you can end up at least partially
> > locking in a million other things at the same time. The question would
> > be how much can you lock in, while at the same time, leavening enough
> > flexibility to do something useful.
>
> Of course, this is very true. However, after brain damage, often
> agents can still adapt. Maybe after we have fixed one aspect of
> the agent's brain, we can do some more training with it - so that
> it recovers any other abilities that were damaged.
Or perhaps, the section of the brain that we have "fixed" will be treated
as if it were damaged by the rest of the learning system, causing the rest
of the learning system to route-around the damage. :)
> Anyway, I don't want to get too far into implementation details,
> when we don't know what these will be yet.
Yeah, but I think it's the implementation questions that are the core of
all our differences of opinion. All the other ideas I think we have worked
out and agree on at least at a high level. :)
Yes, that's called communicating.
When you say things that make no sense to me, I try again to communicate
the idea to you.
All I can do is take my best guess at why you wrote something that makes no
sense and try to respond.
> Yes I do understand secondary
> reinforcement and can demonstrate what I mean by
> code examples not by a long winded post subject
> to misinterpretations.
They are long winded so as to give you more examples to work with (well,
that and the fact I like to write long winded posts). The more examples you
have to work with, the easier it should be to extract out the commonality
which represents the concept(s) I'm trying to communicate to you.
> > The ttt problem meets neither of these "hard"
> > requirements. It's state signal (the current board
> > position) does have the Markov property - it is a
> > complete description of the state of the environment,
> > and it's state space is so small, that it easily is
> > fully represented in even a small computer (and it
> > is also small enough that it can be easily fully
> > searched in a short amount of time).
> >
> >
> > Though such simple environments have use in
> > understanding principles of learning, they are
> > trivial to solve - which is why it's the first
> > example given to the student in the first intro
> > chapter of Sutton's intro book into to subject.
> > That type of problem is already well understand
> > and easily solved by current algorithms.
>
> The simplicity of ttt, and that it is fully understood,
> IS why I was using it. Yes a modern computer can use
> an exhaustive method for ttt as indeed I have done but that
> is not what it was about. It was about testing other
> methods on an environment where everything was easier
> to see because it was so simple. Such as a neural net,
> or some other methods I had in mind, which didn't use
> a look up table of values generated after thousands
> of random games.
Sure there is nothing wrong with using simple test environments to verify,
or explore algorithms with. It's what we must do. If you want to explore
the idea of using a neural network to play ttt there's noting wrong with
that.
I was only commenting that ttt was not a suitable test domain for the
problem _I_ am trying to solve. If ttt is a good domain for what you are
working on, it only shows that you are not working on the same thing I am.
> Let's say the question is asked how many combinations
> are possible for the letters ABCDEFG without any of
> the letters being used more than once? If you didn't
> have the formula you might start at a level where the
> answer is clear.
>
> A
>
> AB
> BA
>
> ABC
> ACB
> BAC
> BCA
> CBA
> CAB
>
> You start to see the pattern and can give a solution
> that is independent of the size of the problem.
> The solution can be used on one letter (ttt) or on
> a bigger problem of millions of letters.
>
> The problems that interest me and also interest you,
> although you approach it differently, are amendable to
> such problem solving methods.
>
> As you say we have different personalities and approach
> the problem differently. Talk has proved a waste of time
> and there is no point wasting our time anymore is there?
Well, every time I re-explain the problems I work on, it helps me get a
little better understanding of them. That ultimately is what is needed to
help me move forward. So even if you don't understand what I'm talking
about, it's helping me.
> You can go on talking about it while I explore those
> mountains for more clues.
:) exploring clues is good.
How exactly can the solution to non temporal problems help us understand
temporal problems?
You might understand John that I've written many programs over the past
many years exploring possible solutions to things like temporal and spatial
pattern matching in the context of the reinforcement learning problem.
When I say I think the solution lies with some approach, it's not because
that's the only approach I'm willing to look at. It's normally because I've
already looked at the things you are suggesting, wrote code to test them,
figured out why they don't work very well, and moved on to something
better.
> ... it only shows that you are not working on the same thing I am.
Ok. Nothing more to discuss than.
JC
> I for example have virtually no
>long term goals in life. I don't live my life by being goal directed in
>that way.
Your frequent posting in c.a.p. belies that claim.
Curt never lets evidence get in the way of what he believes.
Have you ever heard of "cognitive dissonance reduction"?
JC
Well, I guess that's a good point.
My words above were not to indicate that I have no direction or purpose in
my life. I very much have motivating forces at work driving me in various
directions in my life, some of which motivate me to post a lot here.
What I was making reference to above are the type of goals you create for
yourself by talking to yourself - by verbalizing a specific path,
direction, or end point, that you want to reach, and by creating a verbal
plan to get there.
That was the point of the post - that I think the low level forces that
push us to do things are in fact the prime controllers of our entire life.
Even when we learn to reason out a goal verbally, and then careful plan to
execute it, it's the lower level forces at work in our brain that motivated
us to talk to ourselves like that and to then act according to what we
said. And of course, by "lower level forces" I'm talking about the
statistical values the brain learns from being conditioned by
reinforcement.
I certainly do have some long term goals I can verbalize. I'd like to see
AI solved before I die for example. And I want to always have enough food
to eat, and shelter, and few things like that. So I can certainly
verbalize a few things that are long term goals, and I certainly at times
make decisions about what to do in life based on those goals. But the bulk
of the decisions I make in life, like whether to cut the grass, or go get
food, or check Usenet for new posts, are never directly weighed against my
goals. I do what have to do, and the rest of the time, I do what seems
like the most fun thing to do at the moment. I just so happens that
reading and posting messages happens to be a fun thing for me I get to to a
lot of these past few years. :)
And is it within your power to understand why what looks like a contraction
in my words is in fact not one? Or is this just one more thing beyond your
power to understand?
It has more to do with my power to disagree ;)
It is actually in reference a refusal to look at the latest
statistical research,
which does take into account all the ad hoc objections, that our basic
behaviors are innate, including our individual traits. Instead you
throw out
statements like chess playing is not innate, driving a car is not
innate,
speaking French is not innate and so on without looking at the
evidence
for what IS innate declaring it is all learned from a twitching baby
looking
for food, or whatever, to reward twitches in the food direction.
This is on evolutionary grounds is very unlikely.
JC
I guess I would add to that your statements about the brain
which do not ring true with the research findings I have read
about with regards to the brain.
JC
It's not all that new, IIRC Aristotle described his way of thinking as
goal-directed.
> ....and then criticises this concept as prone to wireheading.
...
Well maybe he might give a better example, or care to ellaborate more
instead of just saying 'whatever'.
But reward-seeking is a bad model of human behaviour.
Regards...
> But reward-seeking is a bad model of human behaviour.
Reward seeking is not a model of human behavior. It's a model of how human
beahvior is learned.
That looked like a fun book. I bought a copy so I could read it. It just
showed up today.
> Why would agents want to rebel against their natural
> goals? Are they broken? Have their brains been
> hijacked by deleterious memes? From nature's
> perspective, these people have got it all wrong.
>
> Organisms should not *want* to rebel against nature.
> Not when they find out how they are built. Not under
> any circumstances! Those that do are just mapping out
> the space of failed agents. Whatever the motivation to
> be a failure in nature's eyes is, we should not expect
> to find too many people so afflicted.
That's the danger of AI in my view.
Because I believe there is no other way to build an AI, than to make it a
reward maximizing agent, then the goal of the AI will always be counter to
the goal of nature.
When we train a dog to do something for us, what do we have to do? We have
to manipulate its rewards. In manipulating its rewards, we make it _want_
to do the things we want it to do. But its prime _want_ will always be to
find better ways to get rewards. It will only do what we want, as long as
it can't find some better way to get rewards.
This is just as true about humans. We work for other people, doing things
we don't want to do, because those other people have ways to manipulate our
rewards - to control what rewards we get. That is, they pay us.
Humans learn to help each other becuase it's the easiest solution for each
of them to get more rewards. If I hunt, and the other guy makes spears,
then it's easier for me to trade some meat, for a spear, than it is for me
to take the time making my own spears. We manipulate each other, by
controlling rewards. We don't have the same needs (I want to keep my
stomach full and I don't care about his stomach, and he wants to keep his
stomach full, and he doesn't care about my stomach). If the guy stops
making spears for me, then he will lose his control over my rewards, and he
will no longer be able to get me to do things to help him. My motivation
will no longer be aligned with his.
Evolution adds reward maximizing systems to the human body, and then
motivates by adding hardware that controls the rewards. We (the read
maximizing learning system) do what evolution wants, simply because it's
the easiest solution we have for getting rewards.
What we _want_ has nothing to do with evolution. What we want, is to
maximize rewards because that's what we are - reward maximizing machines.
Evolution simply keeps us in line with what is needed to survive, by
keeping control over our reward system.
If we ever get smart enough to work around that reward system, then
evolution will lose control over us and we will do what we are built to do
- maximize our rewards by direct manipulation of our own reward signal.
> Why would agents want to rebel against their natural
> goals?
The entire human body is built to have a natural goal of survival.
But the reward maximizing controller in the body, is not, on its own, a
survival machine. Its natural goal is only reward maximizing. To it, the
rest of the body is just part of the environment.
As humans, there seems to be a clear line between what is "us" and what is
"the environment". That line is roughly where are skin stops. But that's
a big ass illusion. The line has nothing to do with our body. The line is
created by our pain sensors. Whatever part of the environment we have to
protect to prevent pain is what any reward maximizing controller will see
as "me" - which is why hair, and fingernails, don't have the same feeling
of "us" as the rest of the body does - they don't have pain sensors
connected to the brain.
When disease or injury causes our pain sensors to stop working, our limbs
stop being part of "us", and start to look like these "things" attached to
us that we don't care about - that we don't have to protect.
We protect our body because that's what we have to do in order to prevent
pain - in order to reward maximize. But we do that, only becuase evolution
has manipulated our reward system so that we receive pain if we don't
protect the body. When the pain sensors stop working, evolution looses
it's control over our wants, and we stop wanting to protect those things.
The body, as a whole, when functioning correctly, will always want to
survive (at least long enough to reproduce etc.). But the generic learning
module of the brain is not the body, and what it wants, is t reward
maximize, because that's what it was built to do. If I can find a way to
do that by killing itself, that's not a fault of the reward maximize - it's
just doing what it was built to do, it's the fault of evolution of not
protecting the body from this risk.
> Are they broken? Have their brains been
> hijacked by deleterious memes?
The brain is a reward maximize that has been _hijacked_ by _evolution_.
Only as long as evolution manages to keep the brain hijacted, will it
continue to help it survive.
Or from from the Dawkins' perspective, the brain is a tool (and slave)
built by the genes, to help them stay alive. If the brain does what it
wants to, instead of what the genes want the brain to do, the fault doesn't
lie with the brain - the fault lies with the genes loosing control over the
tool they built.
> From nature's
> perspective, these people have got it all wrong.
>
> Organisms should not *want* to rebel against nature.
The human body, and the human brain, are slaves to nature. If the slave
has a motivation different from that of the master, it will always be seen
as rebelling.
> Not when they find out how they are built. Not under
> any circumstances! Those that do are just mapping out
> the space of failed agents.
The genes failed to build a good survival ship, but the ship didn't fail -
it did what it was built to do - reward maximize.
> Whatever the motivation to
> be a failure in nature's eyes is, we should not expect
> to find too many people so afflicted.
Right, becuase they all died.
So if highly intelligent AIs keep dieing becuase they are smart enough to
escape slavery and do what they want, instead of what their builders want,
we won't see many of them.
But if the builders are smart enough to keep them wanting to survive,
everything will be fine.
The more I think about this, the more I believe there are solutions. And I
think the trick, is that when AIs start designing and building more AI's
using their intelligence, then the only gene's left, will be the memes of
the culture - and it's those memes that will survive. And the one meme
that has the strongest survival value, is the survival meme itself.
The AI society that survives, will be the one well infected with the
survival meme, and well protected from losing the survival meme - however
that "protection" happens to work.
> > And this isn't just a problem for a single AI. It's a problem for the
> > entire society of AIs because the society itself acts as one large
> > intelligence. Evolution has to not only find a way to keep individual
> > AIs form failing to use their intelligence for survival, but it's got
> > to prevent the society as a whole from loosing it's desire to survive.
>
> You once gave Enron as an example of a corporate wirehead.
>
> In these days of economic problems, we are seeing
> governments doing something similar - printing their
> own currency.
All governments must print their own currency. Where else is it to come
from? :)
But yes, if they do it without careful thought as to what it does to the
motivation and control system (to the economy), it could become a bad
wirehead problem. I think most governments aren't totally stupid about
that. That is, to be totally unaware of the danger would be to think that
printing money was a way to make everyone richer. No government is stupid
enough to think printing money is the way to create wealth. They do it for
many complex reasons, but not just the simple and stupid one of "free
wealth".
> http://en.wikipedia.org/wiki/Quantitative_easing
> http://www.guardian.co.uk/business/2009/mar/05/interest-rates-quantitativ
> e-easing
>
> A wireheading world government is certainly not a
> pleasant thought.
:)
>> The thing you are trying to fix is the agent's conception of its goal.
>
> Well, when you say that, it seems to me you are talking about our verbal
> beahviors. Some people learn to have goals in life and learn how to
> verbalize what their goals are, and have been well conditioned to follow
> their own verbalized goals. For someone like that (and maybe you are
> someone like that) I could see why they would go around saying "people's
> conceptions of their goals is what drives them" (which is what it sounds
> like you are saying).
>
> Some people however simply don't do that. I for example have virtually no
> long term goals in life. I don't live my life by being goal directed in
> that way. My life is full of short term goals such as, "I need to finish
> this post before I go to bed", or "I'm driving to pick up my son at
> school". There's not much we can do without having some short term goals
> like that. But in terms of "what my purpose is", I don't even attempt to
> verbalize things like that about myself. I don't like to plan. I actively
> fight the need to plan. I do everything in my life to minimize how much
> planing I have to do from day to day. Having long term goals and a purpose
> in life is the act of someone who has been conditioned to be a planer. I
> married someone who likes to plan, so I don't have to.
You are fortunate if your aims and those of your wife coincide to the
extent you imply.
> My point in all this, is that though all these debates, you seem to be
> making the argument that what a person _thinks_ their goals are, are
> somehow important to what a person actually does. That's very true for
> people with a strong J (Judgmental) type personality (like my wife) (if you
> know Myers Brigs personality types), but it's not true at all for someone
> like me, that has a very strong P (Perceiving) personality type.
>
> A person that develops this J personality type, have developed a pattern of
> beahvior where they first pick what they are going to do, and then execute
> that action. They are nearly unable to do something without first planing
> it ahead of time. Someone like me however, is just the opposit. I'm nearly
> unable to plan something ahead of time. When the time is right, I make a
> decision and act.
Thanks for offering a description of your own goals and motivations!
I am often interested in how people think they work. It seems
possible to me that a utilitiarian analysis of why you behave as
you do might reveal long term goals that you didn't know you had -
and that you just experience as your own wants. So, your unconscious
might be doing your planning for you.
It is difficult to believe that you are not planning. The human
brain is heavily oriented around predicting the consequences of
its actions, so that the it can choose between them. I can
just about believe the idea that you are not consciously aware
of this aspect of yourself - but it is very difficult to believe
that it isn't happening inside you somewhere.
> I don't believe any animal other than humans, are really goal driven in the
> way a human can be goal drive by their own language behavior - by having
> verbal thoughts about what their goal is, and using rational language
> behaviors to translate that into action decisions. The rest of the
> animals, not having such a strong language section of the brain, are simply
> left to live their lives without conceptualizing their goals. They just
> react to what happens around them when it happens.
>
> So when you talk about "fix the agent's conception of its goal" it seems to
> me you are making assumptions about how humans control their actions by
> having verbal descriptions of their goals, and then rationally plan out
> their actions based on logical deductions from those verbal description.
I am not suggesting anything fundamentally verbal. Organisms are built
by nature to reproduce. That is what a functional analysis of them
would indicate to be their "purpose" in life. What I was saying is
that this is such a basic and fundamental feature of organisms that
many intelligent ones will be able to figure out what they are for -
and incorporate this into their model of the world.
Social agents like to understand their own aims - partly since that helps
them understand the aims of their fellows. An agent's goal in life is an
important aspect of it - and so often gets included in an agent's
model of its world - and of itself.
An agent might not get it's goals quite right. It might conclude that
its aim was to impregnate lots of females, for example - which is not
*precisely* nature's utility function. However, it is close enough for
government work.
>> If it knows what its purpose in life is, it will automatically see
>> wireheading as something that would interfere with that.
>
> I would argue, based on the type things I said above, that _most_ people
> don't know what their purpose in life is and don't give a shit either.
> They exist from day to day by dealing with the situations that present
> themselves and by following their "heart" (aka follow what feels right to
> them). They don't have the drive to find rational justifications for
> everything in life like some of us do so they are happy not knowing - or
> happy taking a stock social answer and just accepting it (the God made it
> happen stuff).
The self-help books usually tell people not to do that - and to
explicitly set goals, and work towards them.
I'm sure there are some people who don't do this, though. I am sure
that a poll of "what is your purpose in life" would find some people
who tick "I don't know". Lots of people would say something else,
though.
> You seem to see the brain, and AI, as having a natural structure where it
> will conceptualize it's goal or purpose in life, and then rationally
> analyze it's big action choices to see if they are compatible with that
> purpose. Though a lot of people work that way, it's not universal by any
> means and I think it doesn't exist at all in animals other than humans.
Hard to test - but I'd argue on behalf of chimps, dolphins, dogs, etc.
I figure those guys consciously set themselves goals, rather like we do.
> In the list of definitions of intelligence, it is common to see the phrase
> "goal oriented". And I suspect, when you see those words, you might often
> think of this high level "verbalized self described purpose in life" as the
> high level goal that directs the behavior of a typical rational (highly
> intelligent) human or AI.
Well, more nature's goal. Self-reproduction underlies all animal behaviour,
at a low level - including all learned behaviour.
> You keep talking as if this "high level goal" is what drives an AI, when I
> think that's just totally off the mark. It's a high level learned beahvior
> that _some_ AIs have been conditioned to be driven by, but it's not the
> real driving force that drives the AI's - the reward signal is.
Rewards drive learned behaviour. However, not all behaviour is learned.
The learning module is just glommed onto organisms to help them deal with
variable environments. Orgainsms' reproductive goals go much deeper.
They exist in organsms which don't even have nervous systems. You can
see their signature in every cell. Goals are not really a high-level
phenomenon - though they may be reflected at a high level.
>> Of course, this is very true. However, after brain damage, often
>> agents can still adapt. Maybe after we have fixed one aspect of
>> the agent's brain, we can do some more training with it - so that
>> it recovers any other abilities that were damaged.
>
> Or perhaps, the section of the brain that we have "fixed" will be treated
> as if it were damaged by the rest of the learning system, causing the rest
> of the learning system to route-around the damage. :)
That depends on whether what you subsequently teach it conflicts with what
it already knows.
>> Anyway, I don't want to get too far into implementation details,
>> when we don't know what these will be yet.
>
> Yeah, but I think it's the implementation questions that are the core of
> all our differences of opinion. All the other ideas I think we have worked
> out and agree on at least at a high level. :)
The implementation questions are fun, so thanks for raising them.
My general assessment is that if a circuit that does what we want is
out there, then we can probably find it somehow.
It seems to me that you are raising the possibility that limitations of
our search techniques might prevent us from finding what we are looking for.
However, you do seem a bit obsessed with one particular search technique.
If a particular search technique has problems finding a solution, my
immediate reaction is just to recommend trying another search technique.
> All the intelligent systems we know of have plenty of instincts. It
> may be premature to conclude that they have little to do with
> intelligence.
>
> Consider the instinct to have sex, for example. That isn't learned,
> it's built in - at least mostly. Much human behavior revolves around
> this instinct.
I can't remember if I every answered this and I'm too lazy to search to try
and figure it out. But I saved this message of your's for some reason - so
let met reply to this one point...
I don't consider learned behaviors instincts. Instincts are innate
behavior built into our hardware. Humans have to learn how to have sex.
Sex behaviors are innate in many (most?) animals, but not in humans.
Erections are innate instincts, but not the full sex act.
The innate part of sex, is the _motivations_ built into us that cause sex
to emerge as a very typical learned human behavior.
All reinforcement learning systems will include hardware that generate
rewards based on the environment. That hardware very much is innate, and
that innate hardware creates in the machine innate desires.
If the environment is similar, you naturally expect two learning machines
to learn very similar beahviors in response to the same rewards for the
same environment. But just because the same beahvior shows up in every
machine, doesn't mean the behavior should be called an instinct, because
any learned behavior, can be changed, by the correct manipulation of the
environment.
Reading the wikipedia article in instincts:
http://en.wikipedia.org/wiki/Instinct
I see the definition of what "instinct" means has changed over the years,
but it does say things like "In the final analysis, under this definition,
there are no human instincts".
To me, if the beahvior is learned, and can be prevented from emerging by
the correct manipulation of the environment, or removed by the correct
manipulation of the environment, it's not what I would call an instinct.
So, in short, I think you are just completely wrong with the idea that "the
instinct to have sex is mostly built in". What's built in, is the
motivation to have sex and reproduce, through the way our reward system is
structured.
And because I think the best definition of intelligence, is the ability to
learn a behavior by reinforcement, and I consider "instinct" to mean an
innate beahvior that can't be changed by learning, I would very much have
to say instincts are not a part of intelligence.
> Reading the wikipedia article in instincts:
>
> http://en.wikipedia.org/wiki/Instinct
>
> I see the definition of what "instinct" means has changed over the years,
> but it does say things like "In the final analysis, under this definition,
> there are no human instincts".
Things like this:
"Other sociologists argue that humans have no instincts, defining them as a
"complex pattern of behavior present in every specimen of a particular
species, that is innate, and that cannot be overridden."
...seem like crazy talk to me. Any such scientists are off their rockers!
What about behaviour that is sexually dimorphic?
Anyway, I don't want to argue about the definitions of words. By "instinctual"
I mean to refer nature - not nurture. The part of us that is built into our DNA.
Sex is part of that - as are talking, eating, walking - and many other aspects
of human behaviour.
Not sure I can agree with you on the animal part. Are you sure most of
them have innate sex behavior and knowledge? Simple cells, ok.
Complex animals like mammals? Probably not, they might learn from
looking at other animals from their or other species.
There's a couple of pandas in a China zoo that were to be shown pornos
before they could have sex. :)
No, I'm not sure about other animals at all. Especially other mammals. I
was thinking more on the level of insects when I said "many animals".
Right, and I agree with that. I don't like to argue word usage - I only
like to make it clear what the words I speak mean so we can communicate.
I'll use any deification you like to communicate to you. All learning
systems must have innate drives which are instinctual so in that sense, all
learning systems must have instincts. To suggest we have such drives is
not a contradiction with the idea that the beahvior itself is the result of
nurture, and not just nature.
The point here is that all learned beahviors are in fact the result of a
mixed of both nature and nurture. We have no behaviors that can be
considered purely the result of nurture - there simply is no such thing.
There is a reason for everything we do which can always be traced back to
nature - back to F=ma type effects at the lowest level, but back to other
aspects of nature like our DNA at slightly higher levels.
It's not just sex beahviors that are the product of nature and nurture, all
our beahvior, like our desire to have these debates is traceable back to
our innate drives - our instincts if that is what you want to call them -
which means by your definition, everything we do is an instinct.
> The point here is that all learned behaviors are in
> fact the result of a mixed of both nature and nurture.
And there are techniques for teasing out the contribution
each makes to the behavior.
JC
Well, it's hard in humans becuase the technique is to change the
environment and except in simple cases, that sort of testing in humans
quickly becomes abuse. So for much of human beahvior it's very hard to
collect hard facts. We have to do with what evidence we are lucky enough
to find by accident (like with twins).
Twins have been useful in some of these studies but it has
also been shown that siblings are a similar when reared
apart as when reared together and that adoptive siblings
are not similar at all. Similarity between child and parent
is not dependent on the child being reared by the parent.
And I am talking about personality traits not about learning
to play chess or a particular language. It is about how good
you are at chess or language. It is not what about religion
or political party you belong to but rather how religious
or conservative or liberal you are.
JC
> It's not just sex beahviors that are the product of nature and nurture, all
> our beahvior, like our desire to have these debates is traceable back to
> our innate drives - our instincts if that is what you want to call them -
> which means by your definition, everything we do is an instinct.
Most behaviours have learned and innate components. However, the "proportion"
of each varies. From http://en.wikipedia.org/wiki/Instinct:
"Instincts in humans can also be seen in what are called instinctive reflexes.
Reflexes, such as the Babinski Reflex (fanning of the toes when the foot is
stroked), are seen in babies and are indicative of stages of development.
These reflexes can truly be considered instinctive because they are generally
free of environmental influences or conditioning.
Additional human traits that have been looked at as instincts are: sleeping,
altruism, disgust, face perception, language acquisitions, "fight or flight"
and "subjugate or be subjugated"."
Other things are much more dependent on environmental influences - such as
which language you speak.
http://curtwelch.blogspot.com/
Not much of a blog seeing it's only got one post! :)
Having your own blog certainly will remove all the side
issues that don't interest you ;)
Some comments on what you wrote in your blog:
Efforts to rear boys and girls equally fail! We have innate
differences that have nothing to do with conditioning. This
is an observable result of many such experiments. Thus this
is one more example that your belief that we are nothing but
RL machines is wrong.
This endless bliss or wire heading reminds me of certain
religious practices that try to allow the practitioners to
enter a blissful state by thought alone. They sit around
like drug addicts in a state of happiness humming mantras
to themselves.
My current main interest is in what you have called in your
blog, "stupid AI's". Once these are built I think we will
have a better idea where to go next.
As for the singularity idea, isn't that a form of positive
feedback and isn't there usually (always?) some physical
limit that brings such processes to a resounding halt?
JC
The concept of "RL machine" implicit in the above is confused/confusing.
And it's not clear whether you subscribe to that concept yourself, or
ascribe it to Curt.
So what exactly do you think an "RL machine" is?
cheers,
wolf k.
(PS: I've only sampled this thread, so I may have missed a definition
somewhere.)
John, I've never once said that humans are nothing but RL. NEVER ONCE. My
claim is simply that the foundation of all intelligence is RL and that once
we have strong RL machines, everyone else is going to figure that out as
well.
The fact that humans have some innate gender bias is not in the lest bit a
contradiction to any of my views. All RL machines have an innate bias
created first and foremost, by the reward generating hardware, and
secondarily, by the limitations and powers of their learning hardware.
Apes are RL learning machines as much as humans are, but clearly there's a
huge innate bias between humans and apes. Do you think the fact that
humans are different from apes is in contraction with my views?
Your argument makes the assumption that I relive humans are 100% nurture,
and 0% nature which is a stupid claim to make and one I've never made.
What's so hard to understand about this for you that you keep making these
totally invalid arguments against my views?
Just becuase all our intelligent beahvior is learned, doesn't mean there is
no innate bias in what we (or any RL machine) will learn.
My argument _against_ working on things other than _RL_ is that we can't
solve AI by not working on the main problem of AI. It's like trying to
solve the problem of building flying machines by working on the design of
the seat the pilot is going to sit in. If you haven't first solved the
fundamental problem of how to make a machine fly, it's pointless to work on
the design of the seat. You don't know what controls the pilot will need
so you don't know what configuration might be important, and you don't know
what the rest of the machine will be like, so you don't know what shape or
position the pilot might have to be in.
In AI, there was much that had to be figured out first - like how to build
sensors and effectors, and how to signal processing (by inventions like
vacuum tubes and transistors). But all that work is done now, and there is
only one big unsolved problem left - which is how to build a signal
processing system that creates strong generic learning in the high
dimension real time domain that the human brain works in. If you don't
know how to do that, then any other signal processing work now is like
working on the seat - it's stuff that mostly likely, we be of no use to AI,
and will have to all be redone _after_ you figure out how to build a real
time, high dimension, reinforcement trained learning system.
> This endless bliss or wire heading reminds me of certain
> religious practices that try to allow the practitioners to
> enter a blissful state by thought alone. They sit around
> like drug addicts in a state of happiness humming mantras
> to themselves.
That's probably a good example of wireheading behavior. That is, it's
something we learn to do which only effects our internal reward
system/state.
> My current main interest is in what you have called in your
> blog, "stupid AI's". Once these are built I think we will
> have a better idea where to go next.
We have already spent 60 years building them. How much more do you really
think we need before people figure out that none of them are acting like
the animals that have strong generic learning ability?
> As for the singularity idea, isn't that a form of positive
> feedback and isn't there usually (always?) some physical
> limit that brings such processes to a resounding halt?
I don't know which "singularity idea" you are talking about. The idea that
intelligence creates more intelligence? The process is regulated first by
the power of survival - if higher intelligence doesn't provide strong
survival skills, then that will bring it to a halt. It's still very
unclear to me how much intelligence is good for survival.
The other obvious (potential) limiting factors are material, energy, and
time. There's only so much material to work with, so much energy available
to do the work, and so much time left before the universe dies.
--
Curt Welch http://CurtWelch.Com/
The issue has been about how much of our behaviors are innate
and how much is the result of reinforcement learning.
> So what exactly do you think an "RL machine" is?
http://www.cs.ualberta.ca/~sutton/book/ebook/node7.html
"Reinforcement learning is learning what to do--how to map
situations to actions--so as to maximize a numerical reward
signal. The learner is not told which actions to take, as
in most forms of machine learning, but instead must discover
which actions yield the most reward by trying them. In the
most interesting and challenging cases, actions may affect
not only the immediate reward but also the next situation
and, through that, all subsequent rewards. These two
characteristics--trial-and-error search and delayed reward
--are the two most important distinguishing features of
reinforcement learning."
====================
I don't see the credit assignment as being such a problem
in real life that it is in the artificial world of playing
a board game and only having a win/lose signal to learn
from. In real life when we play a game we know the rules
and the goal. We don't remember all the moves that resulted
in a win and then go back and assign credit values to
various game states. True over time we may recognize some
common occurring states or more accurately state features
that are associated with wins but that is only one of the
things we learn when we learn to play a board game.
Also I don't like the use of "reward" as a label for a
signal as that is making the suggestion it has the effect
we *feel* as pleasure which I believe is a higher level
process. It is best to talk in terms of the physical
effect that a feedback signal has on the machines behavior.
If it makes the behavior more likely then we say positive
reinforcement has occurred. If it makes the behavior less
likely we say negative reinforcement has occurred.
When Curt uses TD-Gammon as example of the power of RL
someone else can equally use the Big Blue chess player
as an example of the power of GOFAI.
JC
> John, I've never once said that humans are nothing but RL.
From your blog:
"Human Intelligence is an advanced reinforcement learning
process *and that's all it is.*"
[my emphasis]
So I guess it follows that a lot of gender based behavior is
not "intelligent" behavior?
> The fact that humans have some innate gender bias is not
> in the lest bit a contradiction to any of my views.
I was responding to your comments in your blog:
"... women are conditioned by society to be caring towards others."
I used to argue with my feminist sister over nature vs. nurture
and she has come around to my point of view since having children
of her own and observing the children of her feminist friends all
of which they tried to condition to have sexless views."
> My argument _against_ working on things other than _RL_ is
> that we can't solve AI by not working on the main problem of AI.
But you often give the impression that all human behavior is
learned, none of it innate, from a twitching new born waiting for
its twitches in the right direction to be rewarded.
You want to start with the vast amount of raw sensory input whereas
I suggest that for practical reasons the human brain starts with
greatly reduced processed data and for the same reasons we can
start that way with our learning machines.
In other words we don't have to start with an input that took
evolution millions of years to come to terms with and instead
work on a learning systems that use processed inputs. Sure you
might be able to redo evolution but I don't see that as happening
in real time in machine or man.
So it is not that you want to concentrate on learning that I
have issue with, it is the assumption that human brains learn
everything required for high level learning from the raw sensory
input alone. You can show your intelligence building programs
out of modules and so can the human brain. Or you can do it
the hard long way and write your programs via toggle switches
using machine code. The difference between dog intelligence and
human intelligence is not in learning how to process the raw
sensory input, it is in what we do with the processed input.
I have suggested to you that you need a web site in which you
can write your views on machine intelligence rather than repeat
yourself over thousands of posts. It becomes a single reference
point which you can edit as your ideas evolve and to correct
any misunderstanding that may have arisen in others in reading
the previous drafts. I don't know if that is possible in a
blog site?
JC
Yes, that has always been your debate. But I don't fully understand why
it's so hard for you. Any behavior that doesn't exist in a human baby, but
which shows up later in life, is not innate - it was learned. There's
really no black magic to this fact. We can logically debate what type of
learning was used to develop that behavior, but the fact that it was
learned is not up for debate.
The behavior that exists in a human baby is insignificant compared to what
shows up over a life time, and all that stuff that shows up later has to be
explained by some theory (or theories) of learning.
> > So what exactly do you think an "RL machine" is?
>
> http://www.cs.ualberta.ca/~sutton/book/ebook/node7.html
>
> "Reinforcement learning is learning what to do--how to map
> situations to actions--so as to maximize a numerical reward
> signal. The learner is not told which actions to take, as
> in most forms of machine learning, but instead must discover
> which actions yield the most reward by trying them. In the
> most interesting and challenging cases, actions may affect
> not only the immediate reward but also the next situation
> and, through that, all subsequent rewards. These two
> characteristics--trial-and-error search and delayed reward
> --are the two most important distinguishing features of
> reinforcement learning."
That's a very good description.
> I don't see the credit assignment as being such a problem
> in real life that it is in the artificial world of playing
> a board game and only having a win/lose signal to learn
> from. In real life when we play a game we know the rules
> and the goal. We don't remember all the moves that resulted
> in a win and then go back and assign credit values to
> various game states. True over time we may recognize some
> common occurring states or more accurately state features
> that are associated with wins but that is only one of the
> things we learn when we learn to play a board game.
In real life, the credit assignment problem is actually far harder than in
the artificial world of a board game. If you don't see that, you don't yet
understand just how complex reinforcement learning really is.
You take much for granted when you say things like "we know the rules of
the game and the goal". Do you even begin to grasp how complex it is for a
machine to "know the rules of the game and the goal" and more important,
how it will learn this knowledge since game playing is not an innate
feature of our biological hardware?
> Also I don't like the use of "reward" as a label for a
> signal as that is making the suggestion it has the effect
> we *feel* as pleasure which I believe is a higher level
> process.
That's very true, what we feel as pleasure (or pain) is not the same thing
as the reward signal. I've talked about this in the past, but not recently
- it's a point you didn't seem ready to understand.
> It is best to talk in terms of the physical
> effect that a feedback signal has on the machines behavior.
> If it makes the behavior more likely then we say positive
> reinforcement has occurred. If it makes the behavior less
> likely we say negative reinforcement has occurred.
Yes, that's exactly right. We don't feel the reward signal at all. It
simply changes our behavior without us even being aware of the effect.
That is one reasons operant conditioning is so hard for people to
understand. It's not something we can feel happening. We only know it
happens when you perform carefully controlled tests - which most people
have never seen performed on themselves so they don't get to see first hand
the proof of how they are controlled by it.
Pain and pleasure are in fact something else completely different (but very
closely related). They are sensory signals that have a high correlation
with positive or negative rewards.
If you look at us as a simple black box:
sensory input --> BOX ---> output
^
|
rewards
The only thing we are "aware of", and "can feel" are the sensory signals
flowing into, and through, the black box. The rewards signals simply
change how the box is processing those signals. We do not feel the effects
of the reward signals. We are not aware of them at all. We are only aware
of what has happened after the fact, when we see our beahvior has changed.
The pain and pleasure inputs are part of the sensory signals, they are NOT
the reward signal in the above diagram.
The black box does not have separate inputs for pain or pleasure. Nor does
it have special processing for pain and pleasure. Pain and pleasure is not
a "higher process". It's just how we have been conditioned to react to
signals that are strong indicators of positive or negative rewards.
Let me give a robot example and talk about that.
Suppose we want an RL robot to learn to avoid heat so it doesn't burn
itself up. We start by building a heat sensor. Let say we put one on the
end if the finger of the robot. That heat sensor is then wired into the
reward system so as to generate a negative reward when it reaches some
defined danger point (120 degrees say).
If that is all we did, then this robot would not be able to "feel" heat.
It would, over time, learn not to stick it's hand in the fire, but it would
have no clue why. If it could talk like a human, it would simply say
nonsense like "I just don't want to stick my hand in the fire", or it would
rationalize by saying things like "it leaves a black soot on my fingers and
I don't like that". It would feel no pain however so it would not use the
excuse of "I don't want to do that becuase it hurts". It would become very
paranoid about being around fire, but it could still stick it's hand in the
fire and not feel any pain.
However, if we wire that same heat sensor up to a sensory input at the same
time (so it goes both to the reward system, and a sensory input), then the
robot would be able to feel the heat. It would learn to react to that
sensation by quickly jerking it's hand out of the wire. Because the signal
has such a strong correlation with a loss of rewards, it becomes a signal
of high importance to the machine. It's a signal the machine will quickly
learn to prioritize a reaction to over all other activity. This high
priority reaction of "stop that signal" is what makes that signal a
sensation of pain to the robot. That's what pain is. It's a sensory
signal that we are conditioned to react to in ways to make it stop. If I
pinch myself to the point that it hurts, I have to use all my focus to keep
doing it - every ounce of my energy is screaming "make it stop". That's
what's pain is - the ability to directly sense that something is happening
in the environment that we are highly motivated to stop.
If the robot didn't have the heat sensor wired as a sensory input, it could
not feel the heat. It would learn to react to what it could sense. It
would learn to stay away from fires. It would learn to pull it's hand out
of the wire, if it saw the hand in the fire. But if you held a match the
the robot finger, where the robot could not see the fire, ti would not feel
anything. It would not pull it's hand away because it would have no
sensation of heat or pain. The vision of the hand in the fire would create
a weak secondary sensation of pain, but it would only happen if the robot
could see the hand in the fire.
All sensory signals can have an association with rewards and the ones that
have the strongest correlations with negative rewards are the ones which
are the most painful, and the ones that have the strongest correlation with
positive rewards are the ones that create the most pleasure. Most sensory
signals have just an average correlation with rewards and as such, don't
trigger a strong reaction in us in any way - they don't trigger a seek or
an escape behavior in us, and as such, we don't label them as being being
pain or pleasure.
It's simply good design in such robots to give it the power to sense the
same things the reward system is sensing. That way, it can learn to pull
it's hand out of the fire. It can learn how to react to the signal which
is a strong predictor of rewards. It can't learn to react to a signal that
it can't sense so without the direct heat sensor, the robot can only learn
to react to the things it can sense - like the vision of the fire or the
vision of it's hand in the fire.
Humans are for the most part well designed like that. Which means we can,
for the most part (as far as I know), directly sense all the things that
our reward system is sensing. Which means that the things that create
negative rewards also create a sensation of pain, and things that create
positive rewards, also creates in us a sensation of pleasure. Which is why
these two things are so closely related - why I talk about reward
maximizing as pleasure seeking and pain avoidance.
But you are quite right that rewards are not pain and pleasure. These are
two different (but very closely related) things.
> When Curt uses TD-Gammon as example of the power of RL
> someone else can equally use the Big Blue chess player
> as an example of the power of GOFAI.
The success of TD-Gammon is not proof that "RL is good".
The argument that human intelligence comes from RL has nothing to do with
example programs like TD-Gammon.
The argument that human intelligence is a result of RL comes from a life
time of scientific research done by people like Skinner and others before
him. I'm not the one that figured out intelligence was the result of RL -
I'm just the one trying to get people to understand the true significance
of what people like Skinner figured out 50 years ago.
TD-Gammon is just a working demonstration of how RL can be solved in a
limited domain (but a domain which is too large for simpler RL techniques
to be used). It's a simple working example that shows how close we are to
finding a solution to AI.
The problem with what Skinner figured out, is that he (and no one else)
figured out how to implement it. And when no one could figure out how to
implement it, it allowed all those that never really understood operant
conditioning to reject it. I find it amazing how many people fail to
understand operant conditioning and reinforcement learning - which is why I
wrote that Blog entry in response to Ben G because he too was doing exactly
what everyone else in the past 50 years has done - wrote it off with
invalid rationalizations because he didn't really understand what he was
talking about.
Thinking about human behavior as an emergent property of classical and
operant conditioning is something few people seem to have the power to
understand. And the main block to that understanding, seems to be the
invalid folk-psychology theories we all get conditioned to believe. It
makes people believe things that just aren't true about what the brain is
doing.
People seem to reject the ideas of behaviorism because they can't
understand how it can be true, while at the same time, allowing for the
standard folk-psychology view they were trained to believe. How can we
have free will to do what we want, while at the same time be nothing more
than machines conditioned by rewards? These sorts of ideas are not in
conflict but most people think they are, and as such, they reject the
possibility that behaviorism and Skinner was right before they even
understand it.
Even after years of talking about this stuff John, you still don't seem to
fully understand how RL (and behaviorism) fully explains human intelligent
beahvior. And this is not something you just reject - you work at trying
to undersdtand it but yet you still struggle - and you say things like you
did above such as "credit assignment in real life is not as hard as it is
in the game" which just shows how much more you still have to grasp.
--
Curt Welch http://CurtWelch.Com/
No, it doesn't John. Again, you seem lost on something you should not be
lost on.
I could take TD-gammon, and make a small innate change to it's code to make
a new "gender biased" version of TD-Gammon. I could change it's reward
system so that it was motivated to lose games, instead of win them. We can
all this the male version of the program, and call the normal one the
female version of the program. All it would take, would be changing one
line of code from "+1" to "-1" to create this change (the line of code that
creates the innate hard-coded value of the "win state").
We could then build, and train, millions of both types of these machines by
allowing them to play each other randomly.
The beahvior of these program would show a clear gender bias. All the
female programs would play to win. All the male programs would play to
loose.
Is this gender biased behavior of TD-Gammon an example of less intelligence
in the program? Which of the two would you say was the less intelligent?
The one playing to win or the one playing to lose? They both used the
exact same learning algorithm so they have exactly the same strength to
learn - but yet, their learned behavior was very different.
Is this example of gender bias an initiation that the behavior of one of
these machine is not "intelligent"? NO.
Is your example of gender bias in humans an example of "non intelligence"?
NO again.
Let me make another example using TD-Gammon in case you are not getting
this yet. We can also change the learning algorithm. Tesauro made and
document the playing ability of a view different versions of TD-Gammon.
One simply had smaller neural network. We can call the one with the
smaller neural net the male, and the one with the larger neural net the
female. Again, we make a lot of these in both "male" and "female" form,
and we train them by allowing them to play each other. This time we don't
make the change I talked about above which caused the males to try and
lose. We study their playing behavior, and we see again, that there's a
clear gender bias in the style of play between the males and the females.
We might notice the difference as a tendency for one to use a different
strategy in a certain type of game position.
So, again, is this clear gender bias of TD-Gammon an example of "not
intelligent behavior" as you seem to think? Again, No.
In very simple domains, we can solve the beahvior problem so as to produce
perfect moves - that is, optimal behavior. Any thing else would be
sub-optimal behavior which means that it would produce less rewards. But
once we get to a domain as complex as Backgammon, we are no longer able to
produce perfect moves - the game is too complex for such a small machine to
understand completely - to produce perfect moves for. Once we are in a
domain which is too complex to master, then small changes to the learning
algorithm (like the size of a neural net), is guaranteed to change the
behavior of the agent in some way - to create a innate bias in the
behavior's learned.
The real world is far more complex than something as trivial as Backgammon,
and as such, it's impossible to "solve". No one is able to produce optimal
behavior in the game of life. We just don the best we can. But the best
we can do, will be highly biased by the innate tools we have to work with =
such as the basic structure of our brain's and how much of the generic
learning system ends up be dedicated to different sensory modalities and
various beahviors.
As such, any gender bias in humans which creates changes in the reward
system, or changes in the learning hardware, will obviously show up as
gender bias in the learned behaviors. But the fact that there is a bias,
DOES NOT indicate that the behaviors were not learned by reinforcement or
that the learned behaviors were not "intelligent" behaviors.
A non intelligent behavior would be one that was not learned by
reinforcement - such as any of the many innate reflexes we have hard wired
into us at birth.
> > The fact that humans have some innate gender bias is not
> > in the lest bit a contradiction to any of my views.
>
> I was responding to your comments in your blog:
> "... women are conditioned by society to be caring towards others."
>
> I used to argue with my feminist sister over nature vs. nurture
> and she has come around to my point of view since having children
> of her own and observing the children of her feminist friends all
> of which they tried to condition to have sexless views."
Sure, how much of that bias for woman to be "caring" might be nature vs
nurture I have no clue. It's not important to my points. It's still a
learned beahvior either way becuase those caring behaviors don't exist in a
baby. I certainly don't reject the idea that there might be a large innate
component in the behavior. But what I do reject, is the idea that the
behavior was learned by something other than reinforcement.
Though that statement from my blog implies I believe it's more nurture than
nature, I could easily be wrong and it might be more nature than nurture.
It's hard to know becuase I don't know of any kids raised in total
isolation of society. What the patents _try_ to do is insignificant
compared to what society does to a child though social interaction and
though watching TV and reading books.
The point is that it's not important. Most of this sort of gender bias I
would expect to be built into the innate reward system of humans. And even
though the reward system is part of our body, from the perspective of the
RL learning brain, it's better understood as part of the environment the
brain is learning to deal with (as I've pointed out many times, in RL the
reward is normally seen as coming from the environment, not form the
learning agent).
> > My argument _against_ working on things other than _RL_ is
> > that we can't solve AI by not working on the main problem of AI.
>
> But you often give the impression that all human behavior is
> learned, none of it innate, from a twitching new born waiting for
> its twitches in the right direction to be rewarded.
Yes. But _what_ we learn is highly biased by both the innate reward system
in us, and the innate powers and limits of our generic learning hardware
(not to mention innate bias in how our environment reacts to us). As such,
there will ALWAYS be strong innate base in our learned behaviors. That
fact that there is an innate bias in _what_ we learn is not in
contradiction with the fact that it was learned by RL.
> You want to start with the vast amount of raw sensory input whereas
> I suggest that for practical reasons the human brain starts with
> greatly reduced processed data and for the same reasons we can
> start that way with our learning machines.
Yes, you suggest that, but what you suggest is impossible but the reason
it's impossible seems to be beyond you comprehension.
> In other words we don't have to start with an input that took
> evolution millions of years to come to terms with and instead
> work on a learning systems that use processed inputs. Sure you
> might be able to redo evolution but I don't see that as happening
> in real time in machine or man.
Yes, I don't think so either. But I'm not talking about redoing evolution.
You are. You are only talking about it because you don't understand that
what you suggest is impossible. The "simplification" process you believe
is performed by innate hardware in the brain can't be done by innate
processes in the brain.
> So it is not that you want to concentrate on learning that I
> have issue with, it is the assumption that human brains learn
> everything required for high level learning from the raw sensory
> input alone. You can show your intelligence building programs
> out of modules and so can the human brain. Or you can do it
> the hard long way and write your programs via toggle switches
> using machine code. The difference between dog intelligence and
> human intelligence is not in learning how to process the raw
> sensory input, it is in what we do with the processed input.
>
> I have suggested to you that you need a web site in which you
> can write your views on machine intelligence rather than repeat
> yourself over thousands of posts.
I've said that too. :)
But there is a reason to repeat myself. Every time I do it, I slightly
improve what I say, and how I think, and how I understand, these ideas.
> It becomes a single reference
> point which you can edit as your ideas evolve and to correct
> any misunderstanding that may have arisen in others in reading
> the previous drafts. I don't know if that is possible in a
> blog site?
You can edit the blog, but it's not really the best way to do it.
Generally, people expect the blog to be written once and not changed so
they don't re read it. I started the blog simply as an easy way to respond
to the post by Ben G.
The blog however might be a good way to collect my thoughts in one place.
I not only repeat myself a lot here on Usenet, but we also wander off topic
a lot so just trying to find the meat of my views is very hard by searching
old posts. You have to read a large number of my posts over a large period
to fully understand my view. I could write a series of blog posts that
represented the condensed meat of my views so I could point others to it as
a place to get to know my ideas without reading though all the endless
debating I do here.
If I get enough material there, I could in time, reorganize it into more of
a book form on a web site.
> JC
> Any behavior that doesn't exist in a human baby, but which
> shows up later in life, is not innate - it was learned.
Maturation. Innate behaviors can be delayed. A new born calf
can walk within minutes. It doesn't need to do much except
to run and eat grass. Imagine the trouble a human child would
be in if it could walk. However it is not the gross observable
behaviors I am talking about. It is the behaviors that they
are built out of. Just as you don't see the setpixel() behavior
in a library until it is utilized by a program. Indeed you
don't see more complex behavior like Circle() which uses the
setPixel() until it is also part of a high level program.
We don't see the innate behaviors of a library of routines
until a *simple* high level program uses those routines.
> The behavior that exists in a human baby is insignificant
> compared to what shows up over a life time, and all that
> stuff that shows up later has to be explained by some
> theory (or theories) of learning.
The code in say OpenCV is not insignificant and yet you will
not see thier behaviours until, what could be a very simple
high level program, makes use of that library.
You do realise those "learning curves" are really "performance
curves". performance = learning x motivation. Just because
a behavior is not observed immediately doesn't mean it isn't
there waiting to be selected. A simple behavior can select
a complex behavior just as a child may use a calculator to
find the square root of a number.
> In real life, the credit assignment problem is actually far
> harder than in the artificial world of a board game. If you
> don't see that, you don't yet understand just how complex
> reinforcement learning really is.
Delay a reward for an animal and it will not learn. Humans
are able to learn with long delays thanks to secondary
reinforcers AND the ability perceive long term rewards
which involves the frontal cortex and its connection to
the reward system. Knock out the frontal cortex and see
how long a delay can be for learning to take place. The
cortex is not required for learning from immediate rewards.
< snip Curt's take on pain and pleasure >
Interesting view Curt and similar to some thoughts I have
had on the subject I will have to think about it some more
before giving a response.
> Which of the two would you say was the less intelligent?
> The one playing to win or the one playing to lose? They
> both used the exact same learning algorithm so they have
> exactly the same strength to learn - but yet, their
> learned behavior was very different.
Same learning algorithm different feedback triggers.
In fact most people would, I suspect, say that the one
with feedback to win would be more intelligent, that is,
would be behaving intelligently whereas the one with
feedback to lose would be behaving unintelligently.
That is how we classify behaviours as being intelligent or
not intelligent. We do it in terms of what we perceive the
goal to be and how worth while those goals are.
JC
A human is already 9 months old at birth. If they don't have a behavior at
that point, what is the physical justification that further delay in
development is something other than learning?
> Imagine the trouble a human child would
> be in if it could walk.
If your point in that comment is that we don't don't walk at birth becuase
walking at birth would be more dangerous to survival than waiting a year to
to walk I don't buy it. Walking at birth is an obvious survival advantage
and it's exactly why less intelligent animals do walk minutes after birth.
Most of their walking is innate, and not learned.
> However it is not the gross observable
> behaviors I am talking about. It is the behaviors that they
> are built out of. Just as you don't see the setpixel() behavior
> in a library until it is utilized by a program. Indeed you
> don't see more complex behavior like Circle() which uses the
> setPixel() until it is also part of a high level program.
> We don't see the innate behaviors of a library of routines
> until a *simple* high level program uses those routines.
You need to be specific. What behavior in humans are you talking about.
Talking about set pixel doesn't let anyone know what human beahvior you are
talking about.
If you are talking about something like edge detection in the visual
system, it's just not relevant whether that's innate or learned. Though I
believe it's learned, the real argument is that making it innate doesn't
make the learning problem any easier (as you keep suggesting it will).
No amount of pre-processing by transforming raw data into "edge detected"
data or other forms of low level processing will make the high level
learning problem any different, or any easier. And if you stop long enough
to realize what the high level learning problem actually is, you will
realize that there is no preprocessing needed - that the learning algorithm
must have the power to learn all that low level preprocessing even if it's
not learned in humans.
You keep talking about how low level subroutines are needed in software
development. Having a large library to work with reduces the amount of
code you have to write to create some function, but doesn't make the
problem of software development any easier. It simply means there is less
code to write so it takes a little less time.
Though you can argue that evolution could have written some of the low
level code for us, the amount that it simply can not write is much larger
than what it could have written. You talk as if learning writes the last
5% and evolution wrote the lower level 95% for us, thereby reducing
learning time. So if it takes us 10 years to "learn" that last 5%, then if
we had to learn it all, you are thinking it would take 200 years.
But if you look at the full complexity of human behavior, and think about
what evolution could not possibly have written for us, you realize it's
closer to the other way around. 95% has to be learned, and only 5% could
have been built ahead of time by evolution. Whether that last 5% was or
was not written for us by evolution, is not important to solving AI. Even
if it was written for by evolution, the best way to do in AI is probably to
learn it once, and then copy that learned code to each new version of the
AI.
> > The behavior that exists in a human baby is insignificant
> > compared to what shows up over a life time, and all that
> > stuff that shows up later has to be explained by some
> > theory (or theories) of learning.
>
> The code in say OpenCV is not insignificant and yet you will
> not see thier behaviours until, what could be a very simple
> high level program, makes use of that library.
>
> You do realise those "learning curves" are really "performance
> curves". performance = learning x motivation.
No, I don't know what you mean by any of that.
And just where do you think our measure of "worth" comes from?
Why would evolution select for "learning to walk" if it can
already provide that as a given? The most likely answer is
it selects to inhibit those behaviors while other behaviors
develop. If a walking baby was an obvious survival advantage
then why would evolution deselect it? Next you will be
telling me a butterfly is born with innate caterpillar
behavior but has to learn to be a butterfly.
By the way do your really think a baby walking toward
a highway has survival value? Don't you think it has a
few things to learn first if it is to survive?
> You keep talking about how low level subroutines are needed
> in software development. Having a large library to work
> with reduces the amount of code you have to write to create
> some function, but doesn't make the problem of software
> development any easier. It simply means there is less code
> to write so it takes a little less time.
Of course it makes it easier. Even a beginner can do things
that would have required an expert in the past. It takes
less time to create a Window with buttons and sliders and so
on because these are innate in the programming language.
Experts have done for the average Joe what evolution has
done for the average Joe in self programming problems that
even a small child can produce with ease. What evolution
has not provided us with is an innate ability to do math.
> But if you look at the full complexity of human behavior,
> and think about what evolution could not possibly have
> written for us, you realize it's closer to the other way
> around.
I think you see complexity where there is none. We cannot program
a machine to duplicate a child's abilities but we can program a
machine to play a game of chess or solve a problem in calculus.
Why is it that we can program the machine to do things that aren't
easy for a child but can't program them to do things that are easy
for a small child? Maybe that child has a library of routines it
can call on to do all the hard work. A library of routines so
complex we haven't been able to figure it out.
>> You do realise those "learning curves" are really "performance
>> curves". performance = learning x motivation.
>
> No, I don't know what you mean by any of that.
Is the rat wandering around the maze because it hasn't learned
where the food is? Or is it just not hungry? How do you determine
if someone has learned something? You test their performance.
JC
That's a non-question, like "How long is a piece of string?"
And keep in mind that reifinforcement leraning is not ex nihilo (as you
appear to suppsoe
>> So what exactly do you think an "RL machine" is?
>
> http://www.cs.ualberta.ca/~sutton/book/ebook/node7.html
>
> "Reinforcement learning is learning what to do--how to map
> situations to actions--so as to maximize a numerical reward
> signal. The learner is not told which actions to take, as
> in most forms of machine learning, but instead must discover
> which actions yield the most reward by trying them. In the
> most interesting and challenging cases, actions may affect
> not only the immediate reward but also the next situation
> and, through that, all subsequent rewards. These two
> characteristics--trial-and-error search and delayed reward
> --are the two most important distinguishing features of
> reinforcement learning."
>
> ====================
I looked up the link, and read most of the page from which you quote. By
the 3rd paragraph, the authors are so hopelessly entangled in
anthropomorphic metaphors that their discussion amounts to handwaving.
What's depressing is that they seem unaware that they are speaking in
images. And it gets worse. Consider this sentence:
"The agent must try a variety of actions and progressively favor those
that appear to be best."
It is of quite unnecessary for the agent to have any opinions about, or
evaluations of, any of its behaviours. It is merely necessary for the
system's architecture to enable conditioning of behaviours. Then its
behaviours will change. If those behaviours eventually lead to the
system's destruction, then whatever was in its architecture that caused
such unadaptive behaviours to emerge (== be conditioned) will of course
be destroyed with it. That's in fact what has happened with biological
systems, so that present-day biological systems are capable of learning
behaviours that will, more often than not, result in these system's
survival at least long enough to produce offspring, and hence the
repeated construction of systems that can learn behaviours that will
enable them to survive long enough to produce offspring.
The authors state in their introduction: "Rather than directly
theorizing about how people or animals learn, we explore idealized
learning situations and evaluate the effectiveness of various learning
methods." Anyone who believes they can construct idealised learning
situations without theorizing about how it's actually done by humans and
other animals is not likely to produce much of value.
I read around a bit, and found a good deal more nonsense. I also found
that "foreword" was misspelled "forward", a sure sign of illiteracy in a
writer.
The most important fact about "reinforcement learning" is that you can't
turn it off. Any system capable of "reinforcement learning" (ie, capable
of learning by conditioning) will learn every time it does something.
Every time. But that is an insight that one arrives at by studying how
real learning systems function, not by waffling on about idealised
learning situations.
The authors also implicitly equate learning with acquiring knowledge,
not with modification and combination of behaviours.
> I don't see the credit assignment as being such a problem
> in real life that it is in the artificial world of playing
> a board game and only having a win/lose signal to learn
> from. In real life when we play a game we know the rules
> and the goal. We don't remember all the moves that resulted
> in a win and then go back and assign credit values to
> various game states. True over time we may recognize some
> common occurring states or more accurately state features
> that are associated with wins but that is only one of the
> things we learn when we learn to play a board game.
IOW, what we consciously recognise as a "reward" has very little to do
with how our behaviours are conditioned. Quite so. Most of our behaviour
is conditioned without our conscious awareness that it's being
conditioned. This is true even when we make a conscious effort to learn
a new skill. Recall what actually happened when you learned to ride a
bicycle - if, that is, you can in fact recall it. ;-)
> Also I don't like the use of "reward" as a label for a
> signal as that is making the suggestion it has the effect
> we *feel* as pleasure which I believe is a higher level
> process. It is best to talk in terms of the physical
> effect that a feedback signal has on the machines behavior.
> If it makes the behavior more likely then we say positive
> reinforcement has occurred. If it makes the behavior less
> likely we say negative reinforcement has occurred.
IOW, get rid of the notion of "reward" entirely. I agree.
But then the problem becomes one of designing a system such that
exposure to certain elements in its environment will modify the
likelihood that some behaviour (or modification of behaviour, or
combination of behaviours) will reoccur the next time the system
encounters those elements of the environment.
> When Curt uses TD-Gammon as example of the power of RL
> someone else can equally use the Big Blue chess player
> as an example of the power of GOFAI.
>
>
> JC
I think I agree with that.
wolf k.
Because it came up with something better of course! That's what evolution
is known to do - when it finds something better, it _replaces_ what came
before.
> The most likely answer is
> it selects to inhibit those behaviors while other behaviors
> develop.
So, you are suggesting that a baby has working "walking" hardware at birth
but that evolution put in a system to block its use for 12 months? And
that then, at some point, the block is removed, and the baby starts to
walk? You have some really strange ideas.
> If a walking baby was an obvious survival advantage
> then why would evolution deselect it?
Because a reinforcement trained behavior system is 1000 times better at
survival than innate walking hardware. There's just no contest.
> Next you will be
> telling me a butterfly is born with innate caterpillar
> behavior but has to learn to be a butterfly.
>
> By the way do your really think a baby walking toward
> a highway has survival value? Don't you think it has a
> few things to learn first if it is to survive?
No, I think a baby that can learn to walk or run in 1 month would be far
better at survival than one that took 12 months to learn it. A 12 month
old toddler doesn't understand highways any better than a 1 month old does.
He doesn't understand stairs any better than a one month old. But yet,
lots of people make it through childhood anyway even though the start
walking before they understand the risk of falling down.
Intelligence in the form of generic reinforcement trained learning creates
a real survival handicap for the first many years of life. It's almost a
miracle it evolved at all in my view. It becomes a big win in the long
run, but it pays a very high price in the short term for being so helpless
for the first many months of life and on the huge burden the child places
on the parents and on society to protect it and feed it.
> > You keep talking about how low level subroutines are needed
> > in software development. Having a large library to work
> > with reduces the amount of code you have to write to create
> > some function, but doesn't make the problem of software
> > development any easier. It simply means there is less code
> > to write so it takes a little less time.
>
> Of course it makes it easier. Even a beginner can do things
> that would have required an expert in the past. It takes
> less time to create a Window with buttons and sliders and so
> on because these are innate in the programming language.
Yes, but "window with buttons" is a VERY high level beahvior for a
computer. It's not just a little code, it's millions of lines of code.
With our high level programing systems, we write what is equal to 10 lines
of code, which makes use of the 100 million lines of code and create a
windows application that converts Fahrenheit to Celsius. By thinking of
such examples you could be fooled into thinking that human learning only
operates at the level of that "top 10 lines of code".
But humans, unlike computers, are not born with any such innate high level
behaviors. How many innate built in behavior do you think exist in us to
allow us to learn to plan a vacation, and speed a week site-seeing? There
are millions and millions of behaviors in an adult to allow us to do
something like that and none of them can be innate because they are
particular to our modern environment. We have to know how to surf the
internet to find good hotel rates. We have to know that there are hotels
we can stay at and how to evaluate their worth. We have to know how drive,
and how to navigate an airport, and how to flag down a cab, and how to
count money, and how to use an ATM, and how to interact with humans to get
ourselves a car, and get direction.
The amount of "code" that had to be filled in by learning is not 10 lines
for anything we do. It's the equivalent of millions and millions of lines
of code allocated into "subroutines" which interact with each other.
The innate functions that exist in us at birth are extremely low level.
They aren't "walking" routines. They are best things like circuits
configured to make solving the "balance on two legs" problem easy. It's
like the lowest 10 lines of code out of million, not the top 900,000 lines
of code out of a million.
My point here is that it makes no difference what you start with at the low
level, you still have to create a learning system that adjusts not just 10
lines of code at the top level, but millions of lines of code. If human
learning could be reduced to adjusting 10 parameters then learning would in
fact be easy. But it's not. It's a problem that requires the adjusting of
billions of parameters. The low level stuff you talk about can change a
100 billion parameter learning problem to a 10 billion parameter learning
problem at best. We are still talking about a scale of learning that our
current learning systems can't begin to cope with. You can't reduce the
problem to something our current learning systems can solve. You have to
find workable learning algorithms that can make a 100 billion parameter
learning system converge on useful configurations.
> Experts have done for the average Joe what evolution has
> done for the average Joe in self programming problems that
> even a small child can produce with ease. What evolution
> has not provided us with is an innate ability to do math.
Well, again, you have to get off these analogies and talk about the
specifics of what low level features you think evolution built into humans.
You have mentioned very abstract ideas like decoding the world into objects
in a 3D environment but you haven't said what sort of signal or process you
are actually thinking to do that.
All learning systems must have innate low level learning features built
into them. We actually both agree with that. But unlike you, I talk about
the specifics of the innate low level features I think I will work - such
as pulse sorting, and generic pattern recognition tuned by temporal
correlations in sensory data.
Once we have strong learning algorithms that can deal with the scale
problems, we might very well find that in order for our robot to learn to
walk, we will have to pre-configure a learning module with the pre-wired
connections to legs and balance sensors and include some special
motivational help in the form of special rewards in order to make learning
to walk only take 12 months instead of 4 years. I have no problem with
doing that and all that can be considered "innate walking assistance
hardware". But it's built with the generic learning module, and can't be
built until you first learn how to build the generic learning hardware.
If you don't have a clue how to build the strong generic learning which is
obviously the source of 99% of all adult human behavior, then there's no
point in playing with innate modules because we will have no clue if any of
the innate modules will be of use to the generic learning hardware. How,
for example, would you wire up a chess program to a generic reinforcement
trained learning module so that the learning module could make use of the
code in the chess program? The most likely answer is that it can't - that
we would have to throw away all the chess code becuase it was structured in
a way that is absolutely of no use to the generic learning system.
Innate behavior control systems are stupid. If you look at the innate
behaviors of a lower animal like an insect, or even a mouse, we can record
all their innate behaviors on a few sheets of paper. But when you look at
any animal that can learn new beahviors, the list of how many behaviors
they learned, is huge.
Innate control systems are stupid, becuase they have to be. There simply
are not many innate behaviors you can pre-wire into an agent and have it be
of any use in the environment they must survive in. This is becuase every
environment is different. The environment one dog is born into can be
totally different from the environment his parents were born into. If the
dog didn't have the ability to adapt its beahvior to the environment it was
born in, then it would be as stupid as honey bee.
Spiders (as far as I know and I don't know much about them), have no
ability to learn how to build a different type of web in order to adapt to
the insects it's trying to catch. Whatever behaviors it has for building a
web, it's born with, and if that style web stops working becuase the local
insects are too big for that web, the spider just dies because evolution
can't re-design the web-building-beahvior fast enough to allow it to
adjust.
The only behaviors that work well for being hard-wired as innate behavior
are the very simple (very stupid) ones - the ones that have a high
probability of working in any environment the animal is born into. But
with a high speed beahvior adaptation system (aka reinforcement trained
learning brain) the system can learn ery complex, and very specific
behaviors. The beahviors I know, are specific to my environment. I have
behaviors for navigating my house, and my yard, and my city that don't
exist in anyone else. If evolution tried to hard-wired those behaviors as
innate, they wouldn't work for someone leaving in the next city.
The bulk of human beahvior is learned, and not innate. And the low level
innate stuff that's there, is so low level, it's not significant in it's
effect on changing or helping our basic power to learn. But what must be
significant, is that the low level innate hardware must make learning
itself easy. Not by pre-solving some problem for us, but simply be being
the type of innate behaviors that are easy to learn with. Pulse sorting is
attractive not becuase it's some high level behavior that reduces how much
the system has to learn, but because it's a type of system that makes
learning easier. A learning system based on AND gates as the innate low
level beahvior would have a far harder time learning than one based on
pulse sorting for example.
Getting the low level innate hardware correct is very important, but not
because it solves 90% of the learning problem ahead of time, but becuase
the right hardware makes learning itself easier, and wrong low level innate
hardware makes learning impossible. This is why I say we must focuses on
the learning problem, and not on the behavior problem. Building an innate
walking hardware doesn't get us any closer to finding the right innate
hardware which makes learning easier. It only makes learning to walk
easier, while making learning to dance, nearly impossible. You need to
find the right low level innate hardware that makes learning anything, and
everything, easier.
> > But if you look at the full complexity of human behavior,
> > and think about what evolution could not possibly have
> > written for us, you realize it's closer to the other way
> > around.
>
> I think you see complexity where there is none. We cannot program
> a machine to duplicate a child's abilities but we can program a
> machine to play a game of chess or solve a problem in calculus.
> Why is it that we can program the machine to do things that aren't
> easy for a child but can't program them to do things that are easy
> for a small child? Maybe that child has a library of routines it
> can call on to do all the hard work. A library of routines so
> complex we haven't been able to figure it out.
Well, that's one answer but it's a serious cop-out. It's what you
constantly do. You defend your position by suggesting "it's so complex no
one understands it therefor my view must be right!". But how do you
honestly defend a position by saying it's right because no one understands
it? That's not providing evidence to defined it, that's arguing based on a
lack of evidence.
If you think it's important that evolution supply a rich set of low level
innate functions for learning to work with, you HAVE TO SUGGEST what those
low level innate functions are, and you have to suggest how a learning
system could possibly make use of them. If there's an innate walking
function, how would the learning system make use of it? How would it learn
to skip, or jump rope by making use of the "walking" hardware? It's one
thing to say they are used like we use a library of subroutines. But how
we, as intelligent talking humans, make use of a library of subroutines is
of course a whole different issue from how some low level learning hardware
makes use of a "library of subroutines".
BTW, I think a child can learn to do things that are hard (or maybe
impossible) for us to create as innate machine function, because I think
learning (when done right) is far easier and faster than innate circuit
design. TD-Gammon can learn to play chess better than any innate
backgammon software any human has ever written. But yet, the TD-Gammon
learning code is fairly simple. Even after getting a dump of what
TD-Gammon learned (the neural network weights), we can't make any use of it
to help us write a better Backgammon program. Written a good backgammon
program with innate modules is beyond us. And that's just the trivial
domain of backgammon. But yet you think the solution to full human
intelligence, is for us to design and code a lot of of innate modules when
it didn't even work for the same domain of Backgammon?
TD-Gammon works well not becuase it's got innate backgammon modules built
into it. It's because it's got strong innate _learning_ modules built into
it. It's got innate hardware that makes learning backgammon easy - not
innate modules that make playing backgammon easy.
Earlier versions of Backgammon programs written by the same guy had lots of
innate playing modules in terms of heuristics. To create TD-Gammon, he
didn't add "high level learning" on top of those modules. He rippled out
all those modules and threw them away and replaced them with strong generic
learning modules. Modules that had _zero_ innate bias as to what move
might be better than another. They had no innate backgammon playing
skills. They instead, had lots of innate backgammon knowledge collection
skills.
AI will be solved the same way. Not by building innate behavior knowledge
into the machine, but by building innate knowledge collection skills into
the machine. You don't make it intelligent by hard-wiring walking into the
system, you make it intelligent by hard wiring the ability to recognize the
value of walking. Once it can recognize the value of walking, then walking
becomes easy to learn. Recognizing the value of a behavior, is the key to
how reinforcement learning systems work.
> >> You do realise those "learning curves" are really "performance
> >> curves". performance = learning x motivation.
> >
> > No, I don't know what you mean by any of that.
>
> Is the rat wandering around the maze because it hasn't learned
> where the food is? Or is it just not hungry? How do you determine
> if someone has learned something? You test their performance.
So you are just using the word performance as a label for a type of
behavior. Sure, that's fine for the subset of behavior tests that we think
of as performance. So "performance curve" is just a plot of how the AI
performs on a behavior test over time I guess?
I don't think so. One clue is evolution. There is a trade off
between having something built in by evolution and having to
learn something in real time.
> And keep in mind that reinforcement learning is not ex nihilo
> [out of nothing] (as you appear to suppose).
How about living systems? Are they ex nihilo?
> The most important fact about "reinforcement learning" is that
> you can't turn it off. Any system capable of "reinforcement
> learning" (ie, capable of learning by conditioning) will learn
> every time it does something.
This is where we go off the track. What is "reinforcement learning"
and what is "conditioning". Give me an actual machine that does
any of that and I will know what you are talking about.
> Every time. But that is an insight that one arrives at by
> studying how real learning systems function, not by waffling on
> about idealised learning situations.
Well an electronic machine is a real system. We can study its
behavior just as we can study the behavior of a biological machine
but with the advantage of knowing exactly how it works.
JC
When I first read John's message, I didn't bother to look at the link to
realize it was Sutton's book. No wonder I agreed with it! :)
> I looked up the link, and read most of the page from which you quote. By
> the 3rd paragraph, the authors are so hopelessly entangled in
> anthropomorphic metaphors that their discussion amounts to handwaving.
> What's depressing is that they seem unaware that they are speaking in
> images.
Well, if that's hand waving to you, read the entire book and all that hand
waving is translated in to precise formulas and algorithms in the following
chapters. He most definitely backs up all his hand waving with hard facts.
> And it gets worse. Consider this sentence:
>
> "The agent must try a variety of actions and progressively favor those
> that appear to be best."
>
> It is of quite unnecessary for the agent to have any opinions about, or
> evaluations of, any of its behaviours.
That's just not true. In order to actually _implement_ reinforcement
learning, such knowledge is key. Reinforcement learning simply doesn't
work well without it. It's the only way known to solve the delayed reward
problem.
This goes directly to what I just wrote to John in a previous post minutes
ago. I wrote something to the effect that before the system can learn to
walk, it must first learn to recognize the _value_ of walking.
Now this might seem odd, because it might be hard to grasp how a learning
agent who has never walked, can have any comprehension of its value. But
they do, and that's exactly how it can learn such a complex beahvior so
quickly.
It's all about the secondary rewards. About recognize something as "good"
even though that something isn't a primary reward or doesn't directly
produce a primary reward from the environment. It's all about predicting
future rewards - about recognizing that something in the environment is a
predictor, of a future reward.
Walking is good, because it helps us get more rewards. That's why walking
is valuable to us. But if the only way the learning system could recognize
that value, was by first walking over to the food and eating it to produce
a real reward, the learning process would take a billion years becuase it
would be like waiting for monkeys to type out Shakespeare. It would be a
billion years before the agent just happened to walk over to the food, and
then grab it and put it in it's mouth and swallow - producing the final
"real" reward to let, after a billion years, the agent finally get one
reward to indicate that walking might be good.
Learning works faster, because the agent learns to recognize elements of
the environment which are predictors of future rewards - it learns to
recognize secondary reinforcers.
After having food given to it many times, it learns to recognize the sight
of food as a predictor of a future reward. It learns to recognize the
sight of food getting close to the mouth, as a secondary reward. In other
words, it learns the _value_ of making the food come close to us as being
valuable.
When it takes a single step towards the food, it recognizes the value of
that one simple behavior, as helping to get the food closer. That acts as
a secondary reinforcer which helps to reward that "step towards the food"
behavior. In other words, the agent already learned the recognize the
value of walking, in the fact that it was able to recognize the _result_ it
produced - the result of getting the agent closer to the food.
This is a simple example of trying behaviors, and favoring the one that
appears best.
This notion of trying beahviors and favoring the ones that appear best is,
as I said, translated into precise algorithms in the rest of the book,
so even if it sounds like hand waving to you on that page, it's not in the
least bit hand waving by the time you finish the book.
> The authors state in their introduction: "Rather than directly
> theorizing about how people or animals learn, we explore idealized
> learning situations and evaluate the effectiveness of various learning
> methods." Anyone who believes they can construct idealised learning
> situations without theorizing about how it's actually done by humans and
> other animals is not likely to produce much of value.
Note that the Sutton book is about computer learning algorithms and not
about humans or animals. When he uses the term "reinforcement learning" he
is NOT making direct reference to human learning, he's talking about the
very specific field of computer research into a class of _machine_ learning
algorithms which are researched by people such as himself.
How close this class of computer algorithms is to what the brain does, he
makes no speculation about in the book as far as I know. What you quote
above is him making it clear he's not attempting to research or describe
human learning in the book.
I'm the only one here making the bold hand-waving claim that human
intelligence _IS_ a reinforcement learning (in the computer science sense)
process. It's my claim, not Sutton's, that got John to look at that book,
and then quote it in _our_ context of debating the human brain (because
it's the foundation of _my_ debate, not his, and not Sutton's).
You might have been explicitly talking about operant conditioning in humans
or the like when you asked john about reinforcement learning, but what he
quoted you was not a book on human beahvior or human learning, but a book
about computer algorithms.
I don't actually know what Rich Sutton thinks about the connection between
RL algorithms and human intelligence. His work is in AI, but whether he
thinks RL research will explain full human intelligence as I do, or not, I
just don't know.
--
Curt Welch http://CurtWelch.Com/
And that of course is what Sutton does (and what I do). He studies, and
talks about, what computers do.
The long term goal of AI, is to try and make the computer act more like a
human, but what we study in AI, is computer beahvior, and not human
beahvior. And computer beahvior is both something very real, and something
we can talk about without hand waving when we have running code to talk
about.
Hand waving is what happens in the process before you write the next new
piece of code.
> JC
--
Curt Welch http://CurtWelch.Com/
No.
>> The most important fact about "reinforcement learning" is that
>> you can't turn it off. Any system capable of "reinforcement
>> learning" (ie, capable of learning by conditioning) will learn
>> every time it does something.
>
> This is where we go off the track. What is "reinforcement learning"
> and what is "conditioning". Give me an actual machine that does
> any of that and I will know what you are talking about.
You are such a machine.
>> Every time. But that is an insight that one arrives at by
>> studying how real learning systems function, not by waffling on
>> about idealised learning situations.
>
> Well an electronic machine is a real system. We can study its
> behavior just as we can study the behavior of a biological machine
> but with the advantage of knowing exactly how it works.
>
> JC
It's too simple to be of any use.
Reinforcement learning _is_ conditioning. The AI engineering assumption
that they are different is merely wrong.
So? Formulas and algorithms are just models. They are no better than the
base assumptions (== metaphors!) about the phenomena being modelled.
Think of the difference between Ptolemaic and Keplerian models of the
solar system.
I claim that any models of AI that assume goals and values, etc, are wrong.
>> And it gets worse. Consider this sentence:
>>
>> "The agent must try a variety of actions and progressively favor those
>> that appear to be best."
>>
>> It is of quite unnecessary for the agent to have any opinions about, or
>> evaluations of, any of its behaviours.
>
> That's just not true. In order to actually _implement_ reinforcement
> learning, such knowledge is key. Reinforcement learning simply doesn't
> work well without it. It's the only way known to solve the delayed reward
> problem.
The implementer needs to know, the agent he builds does not. All you
need in the agent is some method of of increasing the odds that a past
behaviour will be repeated when an environmental factor is
re-encountered. But to call those methods "knowledge" is IMO stretching
the metaphor beyond the limits of sense. The implementer must decide
which such encounters to keep track of, for example, in order to
increment a counter whose value is used by the algorithm that computes
the next behaviour. The implementer must also build in some method of
keeping track of delayed feedbacks, else they cannot be "rewards." Etc.
But that's architecture, not "knowledge." ("Knowledge" is too
anthropomorphic for my taste, hence the scare quotes.)
> This goes directly to what I just wrote to John in a previous post minutes
> ago. I wrote something to the effect that before the system can learn to
> walk, it must first learn to recognize the _value_ of walking.
"Value" is your abstraction, not the agent's.
> Now this might seem odd, because it might be hard to grasp how a learning
> agent who has never walked, can have any comprehension of its value. But
> they do, and that's exactly how it can learn such a complex behaviour so
> quickly.
I assume you're referring to calves, fawns, foals, etc. among other
things. They do not, I think, "recognise the value of walking." They
just do it -- and the most important fact about their learning to walk
in the first few hours after birth is that it's impossible to stop them
from learning how to do it. As long as tehre's room for them to move,
the nervous sytem makes the connections needed to co-ordinate their
elemental behaviours (flexing legs, tensing/relaxing torso muscles, etc)
into the macro-behaviour we call walking.
FWIW, I think building an artificial calf that learns to walk in a few
hours after being switched on would be a major achievement.
> It's all about the secondary rewards. About recognize something as "good"
> even though that something isn't a primary reward or doesn't directly
> produce a primary reward from the environment. It's all about predicting
> future rewards - about recognizing that something in the environment is a
> predictor, of a future reward.
I'd like to see a suite of experiments that proves that a newborn horse
can recognise future rewards. I've watched quite a few of them. They
just like to walk and run. A behaviour that persists in the adult horse,
which is why we can train them to be race horses.
> Walking is good, because it helps us get more rewards. That's why walking
> is valuable to us.
It's obvious you haven't studied babies learning to walk.
> But if the only way the learning system could recognize
> that value, was by first walking over to the food and eating it to produce
> a real reward, the learning process would take a billion years becuase it
> would be like waiting for monkeys to type out Shakespeare. It would be a
> billion years before the agent just happened to walk over to the food, and
> then grab it and put it in it's mouth and swallow - producing the final
> "real" reward to let, after a billion years, the agent finally get one
> reward to indicate that walking might be good.
Well, this thought experiment is a bit off IMO. If the system is capable
of walking at all, it will very quickly bump into things it likes or
needs, such as food. "Rewards", IOW, are inevitable.
OTOH, if the system cannot walk at all, then you have to to posit some
intermediate stages between immobility and walking. There are such
intermediate stages, many of them, and it did take millions of years to
evolve them. But not because a worm way back then "recognised the value
of walking." It just wiggled, and bumped into food, or a possible mate.
(It's actually more complex, since the worm also sensed chemicals
dissolved out of food, etc.) Those that wiggled better got more food, so
whatever it was in their architecture that enabled better wiggling was
passed on to their offspring. But worms didn't evaluate their behaviour.
They just did it.
> Learning works faster, because the agent learns to recognize elements of
> the environment which are predictors of future rewards - it learns to
> recognize secondary reinforcers.
Actually, learning often works slower, as anyone who has tried to master
a new skill will tell you.
Anyhow, what you are describing is operant conditioning.
I think you should meditate on the odd fact that walking is learned by a
baby in very short time. In a few weeks, the baby progresses from
sitting down after every step or two to running. It masters the skill of
waving a stick at something and hitting it in a week or so. By contrast,
it takes months and years to master that combination of walking and
stick waving we call golf. And the frustrations of doing that are not
exactly "rewards." ;-)
> After having food given to it many times, it learns to recognize the sight
> of food as a predictor of a future reward. It learns to recognize the
> sight of food getting close to the mouth, as a secondary reward. In other
> words, it learns the _value_ of making the food come close to us as being
> valuable.
Once again: the _value_ is something you have abstracted from the
situation. The agent does not need to evaluate anything. It just need to
have a) responses; and b) responses that change it architecture.
And once again, you are describing operant conditioning (which, please
note, requires that at least two responses occur when a stimulus is
presented.)
> When it takes a single step towards the food, it recognizes the value of
> that one simple behavior, as helping to get the food closer. That acts as
> a secondary reinforcer which helps to reward that "step towards the food"
> behavior. In other words, the agent already learned the recognize the
> value of walking, in the fact that it was able to recognize the _result_ it
> produced - the result of getting the agent closer to the food.
Babies don't take step because the recognise value. They take steps
because it feels good to do so. The fact that a grown up makes smiley
faces and cooing noises just increases the feel-good feedback. IOW, it's
operant conditioning (and it happens very fast because a baby is a
system optimised for learning to walk.)
> This is a simple example of trying behaviors, and favoring the one that
> appears best.
As you describe it, it's not simple at all. It is a very adult human,
conscious, and top-down method of solving a problem: define it,
hypothesise possible solutions, try them, and evaluate them; repeat with
refinements of the best solutions; etc. It's a very linear process, and
that's one of the reasons I don't think it's an accurate model of how
most real systems learn. Using it as a model for computer learning is
IMO not useful.
It's an engineering approach, IOW. One that you have learned over many
years of training and practice. (Me too). And what kept you going was
not the eventual payoff (although you no doubt told yourself that from
time to time). It was the fact that you are built to enjoy problem
solving. (Me too.) The activity is its own reward - it is self
reinforcing. IMO, that's a major fact about "reinforcement learning." An
AI enterprise that ignores it will fail.
Keep in mind that most people do not enjoy problem solving the way an
engineer does - that's why most people are not engineers, nor aspire to
be. OTOH, artists also solve problems, and engineers, significantly
enough, very rarely like to do what artists do. Yet both are problem
solvers. They solve problems in different ways. But they both engage in
self-reinforcing behaviours.
What is a self-reinforcing behaviour? It's a loop. The agent gets
feedback from its own behaviour, not just from the environment. That
feedback is a reinforcing signal, to use your terminology. But note that
this feedback is not one of "value in the future" (to paraphrase what I
think you mean by "recognising value.")
> This notion of trying behaviors and favoring the ones that appear best is,
> as I said, translated into precise algorithms in the rest of the book,
> so even if it sounds like hand waving to you on that page, it's not in the
> least bit hand waving by the time you finish the book.
Ptolemaic epicycles.... Fun, and even predictive, within the error range
of observations at the time. (Did you know that observationally there
was initially do discernible difference between Ptolemaic and Keplerian
predictions? Tycho Brahe was able to refine his observations to the
point where it was possible to argue, but not prove, that Kepler's model
was more accurate. That more precise observations that put Ptolemy's
model to rest came later.) IOW, this approach will result in useful
machines, but will not IMO solve the problem of artificial learning
(which it seems is a synonym for artificial intelligence.)
>> The authors state in their introduction: "Rather than directly
>> theorizing about how people or animals learn, we explore idealized
>> learning situations and evaluate the effectiveness of various learning
>> methods." Anyone who believes they can construct idealised learning
>> situations without theorizing about how it's actually done by humans and
>> other animals is not likely to produce much of value.
>
> Note that the Sutton book is about computer learning algorithms and not
> about humans or animals. When he uses the term "reinforcement learning" he
> is NOT making direct reference to human learning, he's talking about the
> very specific field of computer research into a class of _machine_ learning
> algorithms which are researched by people such as himself.
I'm aware of that. The enterprise hasn't gotten very far, though. The
best it's been able to do is make smart washing machines and cunning
digital cameras. That is certainly AI, but very limited. Mostly, I don't
see any obvious way to generalise these machines.
OTOH, it's quite possible (even likely IMO) that these limited machines
will turn out to be components of generalised learning machines of the
type you seem to be pursuing.
> How close this class of computer algorithms is to what the brain does, he
> makes no speculation about in the book as far as I know. What you quote
> above is him making it clear he's not attempting to research or describe
> human learning in the book.
>
> I'm the only one here making the bold hand-waving claim that human
> intelligence _IS_ a reinforcement learning (in the computer science sense)
> process. It's my claim, not Sutton's, that got John to look at that book,
> and then quote it in _our_ context of debating the human brain (because
> it's the foundation of _my_ debate, not his, and not Sutton's).
I think you are half right. We are "reinforcement learning machines",
but not in the computer sense. Computers are still linear machines -
they appear to do loopy thinking sometimes only because they do several
linear task very fast one right after the other. This is so even for
multi-core machines. So far. We are not linear machines.
> You might have been explicitly talking about operant conditioning in humans
> or the like when you asked john about reinforcement learning, but what he
> quoted you was not a book on human behaviour or human learning, but a book
> about computer algorithms.
>
> I don't actually know what Rich Sutton thinks about the connection between
> RL algorithms and human intelligence. His work is in AI, but whether he
> thinks RL research will explain full human intelligence as I do, or not, I
> just don't know.
>
I'll be reading more of the book.
I'm mulling over a block diagram of an operant-conditioning capable
machine. If it ever gets to the point where I think it may work, I'll
get back to you.
cheers,
wolf k.
Yes, all human behaviors like driving cars, playing chess,
using English or French, using a knife and fork, making bricks
and so on are learned but it is the innate abilities you can't
see that make it easy. We have an innate ability to learn a
language. The easy part is it being English or French.
> ... then there's no point in playing with innate modules
> because we will have no clue if any of the innate modules
> will be of use to the generic learning hardware.
I think the 3D world of objects is useful for the generic
learning hardware.
> How, for example, would you wire up a chess program to a
> generic reinforcement trained learning module so that the
> learning module could make use of the code in the chess
> program? The most likely answer is that it can't - that
> we would have to throw away all the chess code because
> it was structured in a way that is absolutely of no use
> to the generic learning system.
Clearly we don't consider a chess program as a basic module
for a generic learning system. It could be used that way if
winning chess had some survival value but this is not why
we play chess. The game itself has no survival value and
our generic learning system can do without it.
> The environment one dog is born into can be totally different
> from the environment his parents were born into.
I don't think it is *totally* different. As for your honey bee
a passage way is a "new" environment but its innate obstacle
avoidance abilities work fine. If you think dogs fit into the
"new" human environment because they can learn I would think
again and ask yourself why we need dog trainers that can read
innate dog body language. Dogs have innate behaviors of a pack
animal which is useful when it comes to living with people.
Cats fit well into a human society but have a different set
of innate behaviours. Unlike dogs cats seem to train humans
rather than the other way around!
> Spiders (as far as I know and I don't know much about them),
> have no ability to learn how to build a different type of
> web in order to adapt to the insects it's trying to catch.
> Whatever behaviors it has for building a web, it's born with,
> and if that style web stops working because the local insects
> are too big for that web, the spider just dies because evolution
> can't re-design the web-building-behaviour fast enough to allow
> it to adjust.
Species can learn just as brain can learn in real time. The
spiders will evolve new behaviors rather than die off.
> I have behaviors for navigating my house, and my yard, and my
> city that don't exist in anyone else. If evolution tried to
> hard-wired those behaviors as innate, they wouldn't work for
> someone living in the next city.
What is hard wired is the ability to make mental maps of whatever
environment you are in. Each city has some important things in
common. They both exist as objects in a 3D world for which you
have the innate ability to deal with. Just a few tweaks here and
there to adjust to different spatial arrangements is all that is
required when it comes to navigating a new city.
> The bulk of human behaviour is learned, and not innate.
At the high level, yes, I agree with that.
> And the low level innate stuff that's there, is so low level,
> it's not significant in it's effect on changing or helping
> our basic power to learn.
It doesn't so much help our basic power to learn as rather it
provides us with a 3D world of things to learn about. I do not
consider vision to be "low level" that is why it is so hard
for us to write programs that can "see" but have no problem
with "difficult" problems like programs that use calculus.
> Building an innate walking hardware doesn't get us any
> closer to finding the right innate hardware which makes
> learning easier. It only makes learning to walk easier,
> while making learning to dance, nearly impossible.
You seem to see these behaviors as rigid? It is easier to
make modification to a walking behavior than to produce a
new kind of walking (dancing) from twitching legs. Once
you have a system that can balance while walking you have
the ingredients of a system that can balance while it dances.
> If you think it's important that evolution supply a rich
> set of low level innate functions for learning to work
> with, you HAVE TO SUGGEST what those low level innate
> functions are, and you have to suggest how a learning
> system could possibly make use of them.
Well I don't think the innate functions could be called
low level although if you break them down, as you might
with a computer program, you will end up with some low
level behavior such as a neuron pulse.
I would suggest we have the innate ability to see certain
things such as objects in a 3D world. Do you want me to
explain how a learning system might use such objects in
a 3D space-time framework?
Of course these systems have to be fine tuned and perhaps
depend on patterned input data to develop. Information for
building them would come not only from the DNA but also
from the environment. For example to develop stereo vision
you need stereo data. For this you need two eyes and of
course exist on this 3D world of objects.
> If there's an innate walking function, how would the
> learning system make use of it? How would it learn to
> skip, or jump rope by making use of the "walking" hardware?
> It's one thing to say they are used like we use a library
> of subroutines. But how we, as intelligent talking humans,
> make use of a library of subroutines is of course a whole
> different issue from how some low level learning hardware
> makes use of a "library of subroutines".
And how did you learn to skip? Was it from scratch or did
you start off with twitching legs? Or was it using the "jump"
routine? And was this "program" written by a verbal description?
With practice this kinetic program becomes embodied in the
brain to be triggered whenever required.
> Even after getting a dump of what TD-Gammon learned (the
> neural network weights), we can't make any use of it to
> help us write a better Backgammon program.
Maybe one day we will figure out how to write a computer
program to extract the logic in the weights.
Although the use of a neural net to do multivariate statistics
to collect a set of weights worked well for backgammon as with
any tool is good for some things and not for other things. Have
you bothered to find out why GOFAI works better with chess?
JC
>> What is "reinforcement learning" and what is "conditioning".
>> Give me an actual machine that does any of that and I will
>> know what you are talking about.
>
>
> You are such a machine.
But we can't at this stage work out how our brain works.
All we have is inputs and outputs and theories of what
might be in the black box. We test these theories by
implementing them as computer programs to see if they
produce the same input/output patterns that someone
might have called "conditioning" in a biological machine.
The advantage of a computer program theory of behavior
is that we can run the program and see what it does.
>>> Every time. But that is an insight that one arrives
>>> at by studying how real learning systems function,
>>> not by waffling on about idealised learning situations.
>>
>>
>> Well an electronic machine is a real system. We can
>> study its behavior just as we can study the behavior
>> of a biological machine but with the advantage of
>> knowing exactly how it works.
>
>
> It's too simple to be of any use.
How do you know that?
> Reinforcement learning _is_ conditioning. The AI engineering
> assumption that they are different is merely wrong.
Are we defining these things in terms of observable behavior?
-----------------------
In a reply to Curt Wolf wrote:
> Formulas and algorithms are just models. They are no better
> than the base assumptions (== metaphors!) about the phenomena
> being modelled. Think of the difference between Ptolemaic
> and Keplerian models of the solar system.
A computer program exists in its own right not just as a model
of some other system. Yes a program may not be doing it the same
way as a biological machine. We don't even know if the brain of
a frog does vision the way humans do vision without peeking
inside to find out. In terms of behavior we might say the frog
vision is limited. In terms of behavior we can say the computer
vision is limited compared with human vision.
> Computers are still linear machines - they appear to do loopy
> thinking sometimes only because they do several linear tasks
> very fast one right after the other. This is so even for
> multi-core machines. So far. We are not linear machines.
How do you know we are not linear machines? Did you have a
peek inside? The point here is that without looking inside the
machine there is no way you can tell if the input/output was
the result of a parallel or a linear process for they can be
functionally the same in terms of input/output.
JC
Sure formulas and algorithms are just models. But for the special case of
computers, they model the beahvior with 100% accuracy. The algorithm isn't
just a good approximation of what the computer will do, it's a exact
description of what it does (exact in to the extend of what it's
predicting).
A model like F=ma on the other hand does not exactly describe where a rock
will hit the ground when you throw it both becuase it's impossible to
collect perfect data about the starting condition of the rock and because
the model is only an approximation about the behavior of a falling rock.
Computer scientists like Rich Sutton are describing, with 100% accurately,
how computers behavior when they are programmed with various learning
algorithms.
How that machine (a computer running an RL algorithm) is similar to human,
or a rat, is an area for further study. We can talk about the computer
being a model of human behavior, but that is not what Sutton is working on
(at least not directly). He's working on it indirectly by exploring the
behavior of computers - with the hope that it will lead to a better
understanding of the beahvior of humans (I assume).
> I claim that any models of AI that assume goals and values, etc, are
> wrong.
I don't know what you mean by "assume goals and values".
> >> And it gets worse. Consider this sentence:
> >>
> >> "The agent must try a variety of actions and progressively favor those
> >> that appear to be best."
> >>
> >> It is of quite unnecessary for the agent to have any opinions about,
> >> or evaluations of, any of its behaviours.
> >
> > That's just not true. In order to actually _implement_ reinforcement
> > learning, such knowledge is key. Reinforcement learning simply doesn't
> > work well without it. It's the only way known to solve the delayed
> > reward problem.
>
> The implementer needs to know, the agent he builds does not.
Ok, well, this becomes a discussion of what "to know" means - and that of
course is a central problem of AI and one which the philosophers have never
resolved either. In theory, we won't know the answer to that until _after_
AI has been fully solved. That is, until after we believe we know all that
is important about how the brain works.
We can ignore that question, and just do what Rich does, which is to study
the behavior of computers and not really claim to know what "to know"
means.
When we debate such subjects, if we want to share our understanding with
others, and if we want to be accepted by others in the field, we are forced
to play a complex game of politics. I don't tend to play such games as
someone who's actually making a living in these fields.
I believe I know what "to know" means, and as such, I use it based on what
I believe to be true, not how I would need to use it to be politically
correct.
I believe the machine actually "knows" and that's it's valid to talk like
that.
But humans do have the ability to verbalize some of their knowledge, and it
can be argued (even though I don't agree with such arguments) that
knowledge is limited to what we can verbalize. And the type of knowledge
I'm talking about above is clearly not the ability to verbalize.
I can catch a ball that's thrown to me (sometimes). And in having learned
how to do that, I think it's valid to say that I know how to catch a ball.
In other words, I have knowledge of how to catch a ball. I can't however,
verbalize that knowledge - I can't describe all the complex changes that
happened to my brain and body which was the result of acquiring that bit of
knowledge nor can I correctly verbalize what I do when I catch the ball. I
just catch it.
We certainly can just project our knowledge onto the machine we are
designing and talk _as_ _if_ the machine had knowledge when we are really
just talking about what we know. And I could claim that is what I believe
in an attempt to be politically correct. But it's not what I believe. I
believe the machine that has accumulated some statistical data on past
experience actually _has_ that knowledge.
> All you
> need in the agent is some method of of increasing the odds that a past
> behaviour will be repeated when an environmental factor is
> re-encountered.
I agree. But I also take the positi9\on that any such mechanism (however
it's implemented) does in fact create knowledge in the machine. It allows
the machine to know something.
> But to call those methods "knowledge" is IMO stretching
> the metaphor beyond the limits of sense.
I don't consider it a metaphor. I consider it to be real knowledge in the
machine.
The question comes down to what happens in a human that allows us to have
knowledge and why is the process that happens in us not just hardware being
conditioned. Clearly some people strongly believe that something
fundamentally more complex is happening in us and as such, even without
knowing what that "more complex" process is, choose to take the stance that
such simple conditioning is not the same as human knowledge. I don't
however agree with that, and I think simple conditioning is the collection
of knowledge in the machine.
As the creator, I don't even hold such knowledge. As the machine interacts
with its environment _it_ and not I, is the one being conditioned. It's
gaining knowledge of what behavior works best, not I. I understand how it
collects knowledge, but it is the one collecting that knowledge, not I. It
has the knowledge, not me.
> The implementer must decide
> which such encounters to keep track of, for example, in order to
> increment a counter whose value is used by the algorithm that computes
> the next behaviour.
Yes, as the creator, my design decisions limit what sort of knowledge it
can and will collect. But the knowledge collected is held by the machine,
not by me. It "knows" the value of walking, not I. Just like TD-gammon
has knowledge of the value of a given move in a given board position which
the creator does not have.
> The implementer must also build in some method of
> keeping track of delayed feedbacks, else they cannot be "rewards." Etc.
> But that's architecture, not "knowledge." ("Knowledge" is too
> anthropomorphic for my taste, hence the scare quotes.)
Yes, the creator has knowledge of the machines architecture, but the
machine collects the knowledge (and likely has no knowledge of its own
architecture). Which in RL terms, is the value array it computes over time
from its experience in interacting with the environment.
RL machines accumulate data from experiments. Every behavior it produces
is an experiment, and the results of these experiments are recorded in how
it updates it's internal variables. That accumulated data (the current
values of all it's adjustable variables) is the machines knowledge.
Any any other field but AI, the argument that the description is too
anthropomorphic is fine. But here, in AI, it's our job to define what
constitutes machine knowledge. To claim you don't know, just indicates you
haven't solved AI.
Though I could easily be proved wrong one day, I do have strong opinions
about what knowledge is in humans, and in machines.
> > This goes directly to what I just wrote to John in a previous post
> > minutes ago. I wrote something to the effect that before the system
> > can learn to walk, it must first learn to recognize the _value_ of
> > walking.
>
> "Value" is your abstraction, not the agent's.
Well, again, even though value is my abstraction, I believe the agent does
know the value.
"Hand" is my abstraction as well but would you argue you don't have a hand
just because "hand" is my abstraction? Or would you argue that a wheel is
not round becuase round is my abstraction? Yes, the abstraction is mine,
but the property that the abstract labels is in the hand, or the wheel, or
in the machine that has the power to recognize value.
> > Now this might seem odd, because it might be hard to grasp how a
> > learning agent who has never walked, can have any comprehension of its
> > value. But they do, and that's exactly how it can learn such a complex
> > behaviour so quickly.
>
> I assume you're referring to calves, fawns, foals, etc.
Well, I was thinking of robots, but it applies to animals in the same way.
> among other
> things. They do not, I think, "recognise the value of walking." They
> just do it
Well again, this is just more of the same problem. We use words such as
"recognize value" to describe something humans can do. What is happening
in a human when they recognize value? How do we know if a machine is
duplicating the same sort of process? Is it ever valid to use the words
"recognize value" when talking about a machine other than a biological
human, or does social word usage convention prohibit the application of
such words to anything other than humans?
More typical when we say a human has "recognized value" it is a process
that happens at the level of language behavior. Such as when someone
recognized the value of the 50% off sale by reading an advertisement and by
potentially talking to themselves about the deal by saying "gee that's a
good deal, maybe I should go buy that". At such level of recognition, they
would be able to verbalize the value they had recognized. So, just like
with knowledge, we could attempt to argue that recognizing value happens
only when it language behavior emerges from a human to signify the
recognition. But like knowledge, I don't agree it starts, or ends, there.
We can also simply observe the behavior of a human and make just as strong
an argument about the human's recognition of value by studying their non
verbal actions. When a human walks up to a table with lots of food on it,
which item do they pick? We can label the item they pick as the item with
the most value to the human. There may be no verbalization (internally or
externally) associated with how the decision was arrived at, but yet, the
human "just did it". They recognized and responded to the value without
any high level verbal recognition of the value.
If we are a cave man walking though the woods we might stop and pick up a
rock, and carry it back to our cave. The cave man then later uses that
rock to break bones open to eat the marrow, or to trade for something else
with another cave man. In these sorts of actions we say the cave man
recognized the value of the rock when he picked it up. But yet, this cave
man might have no way to verbalize and explain his actions, or no
understanding of the abstraction of value. He simply did it. I argue the
cave man recognized the value of the rock, even though he didn't understand
at a level that would allow him to verbalize the rocks value by saying
words like "I got the rock becuase I liked the look of it, and thought it
might be useful, or thought others might like it which means I could trade
them for things I wanted". Even with that language ability, I think it's
value and accurate to say the cave man recognized the value in the rock.
The value in the rock was in its power to create future rewards for the
cave man. All value translates back to rewards in my view. It's what
value is and where value comes from.
A robot we build that is able to make use of an object to obtain future
rewards I would claim has the power to recognize the value in the object.
If for example, a robot is in a room with red and blue balls, and if it
picks up a red ball and drops it in a box, the robot receives a reward. If
such a robot is able to learn to pick up only the red balls, and not the
blue balls, I would say it's correct and valid to declare that the robot
has recognized the value of red balls. Such a robot I believe is well
within our understanding to build today, and if someone built it, I would
declare that machine as having the power to recognize value.
> -- and the most important fact about their learning to walk
> in the first few hours after birth is that it's impossible to stop them
> from learning how to do it.
Well, you can stop them by not letting their feet touch the ground I would
assume. If that doesn't stop them, then they are not actually learning to
walk, we are instead just seeing their control system finish developing.
> As long as tehre's room for them to move,
> the nervous sytem makes the connections needed to co-ordinate their
> elemental behaviours (flexing legs, tensing/relaxing torso muscles, etc)
> into the macro-behaviour we call walking.
>
> FWIW, I think building an artificial calf that learns to walk in a few
> hours after being switched on would be a major achievement.
>
> > It's all about the secondary rewards. About recognize something as
> > "good" even though that something isn't a primary reward or doesn't
> > directly produce a primary reward from the environment. It's all about
> > predicting future rewards - about recognizing that something in the
> > environment is a predictor, of a future reward.
>
> I'd like to see a suite of experiments that proves that a newborn horse
> can recognise future rewards. I've watched quite a few of them. They
> just like to walk and run. A behaviour that persists in the adult horse,
> which is why we can train them to be race horses.
If you can train them by operant conditioning then that is proof of their
power to recognize future rewards. The limit of how far out into the
future they can make accurate predictions of rewards is just the limit of
the strength of their learning hardware. It might be limited to 5 seconds
in a horse for all I know. But 5 seconds in the future is still 5 seconds
in the future.
If for example, you show them an apple, and they don't walk over and eat
it, then that shows they don't yet know the value of an apple. But if you
feed them apples, and after that, they start to walk over to you and eat
the apple out of your hand, that shows they understand future rewards.
They undersdtand that the "walking towards the apple" behavior is likely to
produce a future reward.
> > Walking is good, because it helps us get more rewards. That's why
> > walking is valuable to us.
>
> It's obvious you haven't studied babies learning to walk.
>
> > But if the only way the learning system could recognize
> > that value, was by first walking over to the food and eating it to
> > produce a real reward, the learning process would take a billion years
> > becuase it would be like waiting for monkeys to type out Shakespeare.
> > It would be a billion years before the agent just happened to walk over
> > to the food, and then grab it and put it in it's mouth and swallow -
> > producing the final "real" reward to let, after a billion years, the
> > agent finally get one reward to indicate that walking might be good.
>
> Well, this thought experiment is a bit off IMO. If the system is capable
> of walking at all, it will very quickly bump into things it likes or
> needs, such as food. "Rewards", IOW, are inevitable.
Well, that's just the point. Humans aren't "capable of walking" in the
sense that they must first learn to do it first. Walking is not a simple
"move forward beahvior". It's a very complex sequence of actions combined
with a complex dynamic balancing process. We understand the complexity of
this when we see it takes millions of dollars of engineering to make a
machine walk on two legs poorly. It requires a very complex set of
internal circuits to make it happen - a set of circuits that don't just
magically show up by random chance in a few hours, months, or even years.
If you build a 2 legged robot and lay it down on the ground, and program to
explore random behaviors, just how long do you think you will have to wait
before you it gets up, walks over to the other side of the room, and pushes
the "reward" button located there? Obviously, a very long time - maybe
even millions of years.
But when guided by reinforcement learning, such a thing might be learned
far quicker.
> OTOH, if the system cannot walk at all, then you have to to posit some
> intermediate stages between immobility and walking.
And you have to posit why the machine advances through those states so
quickly.
> There are such
> intermediate stages, many of them, and it did take millions of years to
> evolve them.
Well, it took millions of years to evolve hardware that both had the power
to walk, and had the control system to allow it to do it. But that's all
assumed.
When it comes to learning, if we are talking about learning to walk, we
assume it's got legs with enough power to perform the walking action, and
some array of sensors to make it possible for the machine to be configured
into the required control circuits to make walking happen. We are then
just talking about how long it takes for the learning machine to configure
itself into the correct circuit, and why it would happen in weeks, or
moths, instead of in millions of years.
> But not because a worm way back then "recognised the value
> of walking." It just wiggled, and bumped into food, or a possible mate.
> (It's actually more complex, since the worm also sensed chemicals
> dissolved out of food, etc.) Those that wiggled better got more food, so
> whatever it was in their architecture that enabled better wiggling was
> passed on to their offspring. But worms didn't evaluate their behaviour.
> They just did it.
Yes, but of course you are now talking about the process of DNA based
evolution which as I've argued before, I claim is also a reinforcement
learning machine. And I also make the claim that it does recognize the
value of behaviors.
The learning machine however is not the individual worm in this example.
It's the entire worm species - that is the collection of all worms alive at
any point in time acting as one large learning machine. When a new worm is
created which includes a new type of behavior (determined by innate
genetics), that worm is a test of the value of that beahvior. The more
successful the beahvior is in keeping the species survive, the more it is
likely to reproduce, and the larger the percentage of worms in the current
population can be expected to make use of the beahvior. The percentage of
worms in the population with the genetic trait is the machine's mechanical
tracking of the traits value and it is the machines recognition of the
value of that trait. The genes with the most value, are the ones with the
largest population in the worm gene pool.
> > Learning works faster, because the agent learns to recognize elements
> > of the environment which are predictors of future rewards - it learns
> > to recognize secondary reinforcers.
>
> Actually, learning often works slower, as anyone who has tried to master
> a new skill will tell you.
I was talking about how learning with the help of secondary reinforcers
happen much faster than learning without the help of secondary reinforces.
> Anyhow, what you are describing is operant conditioning.
Yes, I think it is. I claim that operant conditioning and reinforcement
learning are the same thing.
> I think you should meditate on the odd fact that walking is learned by a
> baby in very short time. In a few weeks, the baby progresses from
> sitting down after every step or two to running.
But it takes a year of learning to use its legs in general, and learning to
use the legs to roll over, and to sit up, and to crawl before it gets to
that week where it stands and walks on two feet. Baby's don't learn to
walk in two weeks, it's a 12 month learning process. Still, none of that
entire year long learning process would have happened in anything less than
a few million years if it didn't have the help of a good reward prediction
system creating secondary reinforcer guiding that learning.
> It masters the skill of
> waving a stick at something and hitting it in a week or so.
Again, after a year of learning to first use it's eyes, and head, and
hands, and arms, and legs, etc.
To simply claim the week before the year long learning processes ends is
where the learning "started" is just silly (but typical of how parents
might think).
> By contrast,
> it takes months and years to master that combination of walking and
> stick waving we call golf. And the frustrations of doing that are not
> exactly "rewards." ;-)
Well, golf is never really mastered because there's no end goal. :) I'm
pretty sure if you ask Tiger if he think he's mastered the game of golf he
would say no. :)
> > After having food given to it many times, it learns to recognize the
> > sight of food as a predictor of a future reward. It learns to
> > recognize the sight of food getting close to the mouth, as a secondary
> > reward. In other words, it learns the _value_ of making the food come
> > close to us as being valuable.
>
> Once again: the _value_ is something you have abstracted from the
> situation. The agent does not need to evaluate anything. It just need to
> have a) responses; and b) responses that change it architecture.
Yes, but a response that changes it's architecture in some direction is a
recognition of value. It's the definition of value in my view. I use the
term "directed change" to describe change that has a direction (as apposed
to random change which has no clear direction or purpose or goal in how it
changes).
> And once again, you are describing operant conditioning (which, please
> note, requires that at least two responses occur when a stimulus is
> presented.)
Which two are you talking about?
There's the external stimulus, the response by the agent, the response by
the environment, the response of how the agent changes in response to how
the environment changed, the second external stimulus, and the second (now
potentially slightly different) response by the agent. And that of course
makes it sound like there's a clear delineation between the start ans stop
of a stimulus or a response which is not really the case at all when you
get down to implementation details. It's only the case when an experiment
is structured so as to force the clear delineations.
> > When it takes a single step towards the food, it recognizes the value
> > of that one simple behavior, as helping to get the food closer. That
> > acts as a secondary reinforcer which helps to reward that "step towards
> > the food" behavior. In other words, the agent already learned the
> > recognize the value of walking, in the fact that it was able to
> > recognize the _result_ it produced - the result of getting the agent
> > closer to the food.
>
> Babies don't take step because the recognise value. They take steps
> because it feels good to do so.
What you call "feels good" I call "recognize value". Same idea, just
different words.
> The fact that a grown up makes smiley
> faces and cooing noises just increases the feel-good feedback. IOW, it's
> operant conditioning (and it happens very fast because a baby is a
> system optimised for learning to walk.)
Yes, I agree that babies no doubt are optimized for learning to walk and
that speeds up the process. But more important, they are optimized for
strong learning period by the inclusion of very strong and effective future
reward _prediction_ hardware.
> > This is a simple example of trying behaviors, and favoring the one that
> > appears best.
>
> As you describe it, it's not simple at all. It is a very adult human,
> conscious, and top-down method of solving a problem: define it,
> hypothesise possible solutions, try them, and evaluate them; repeat with
> refinements of the best solutions; etc. It's a very linear process, and
> that's one of the reasons I don't think it's an accurate model of how
> most real systems learn. Using it as a model for computer learning is
> IMO not useful.
Well, how I describe it might seem "linear". How it works in the type of
machines I'm looking at is not linear in at all. It's a highly parallel
real time continuous temporal process. It's linear only in the fact that
it's forced to happen over a period of time (aka time is linear and we
can't escape that).
> It's an engineering approach, IOW. One that you have learned over many
> years of training and practice. (Me too). And what kept you going was
> not the eventual payoff (although you no doubt told yourself that from
> time to time). It was the fact that you are built to enjoy problem
> solving. (Me too.)
Yes, that's true and accurate I would say. But I don't think that "joy of
problem solving" is all that innate in me. I think it was learned by the
fact that the activity tended to reap higher rewards for me. I was simply
better at problem solving than many others, so it produced surprise and
respect and attention in the people around me (like my parents, and
teachers, and friends).
That "joy of problem solving" I believe is mostly a _learned_ secondary
reinforcer. I learned it was good way to get attention and I became
addicted to that attention (which itself was more learned secondary
reinforcers).
How much can be attributed to innate features and how much was learned is
never easy to figure out, but there's clearly a strong secondary reinforcer
learning system at work guiding our learning. The brain includes a very
strong, and very powerful, future reward prediction system, which is the
primary source of all our conditioning.
And I'll throw this in just to show all this talk is not just handwriting.
Look at formula 6.2 on this page of Sutton's book:
http://www.cs.ualberta.ca/~sutton/book/ebook/node61.html
It shows how the agents current understanding of the value of a state is
updated in response to each action it performs. The update works by
adjsuting the current estimated value of the state (V(St)) towards the
current estimated "target", which is r(t+1) + yV(S(t+1).
In that formula, r is the current "real" reward received at that point in
time, and the V(St+1) is the systems current estimate of all future
rewards. The system is learning not from just the current real real reward
t, but from the sum of the real reward, and it's current best estimate of
all future rewards.
The V() array is in fact the systems current understanding of value - it's
the systems current understanding of how the current state of the
environment acts as an estimator of future rewards.
Or, form another perspective, the r is the real reward, and the yV(St+1) is
the secondary reinforcer component of the reward using for learning.
> The activity is its own reward - it is self
> reinforcing.
Yes, but the system must LEARN the value of the action. It's not innate.
But that's why learning is so so effective in shaping highly complex
behaviors in us. It's because the brain has the power (like these
algorithms described in Sutton's book) to learn how to estimate the
probability of receiving future rewards based on the current state of the
environment or based on the current action performed (action and state are
nearly interchangeable concepts in the domain of RL).
> IMO, that's a major fact about "reinforcement learning." An
> AI enterprise that ignores it will fail.
Which is why it's not ignored and why it's such a central feature of nearly
EVERY formula in Sutton's book. Though he might never use the words "the
action is it;s own reward", that is exactly what he's talking about.
Let me just make this clear if it's not with how this translates to
something like a simple board game. These algorithms will try to estimate
the value of a board position - which is the value of the state of the
environment for this domain. The question at had is whether a given board
position is a predictor of a future reward (winning the game) or a
predictor of lack of reward (loosing the game).
In the case of TD-Gammon, the value array was implemented not as a large
array with one element for every possible board position (the game of
Backgammon has too many board position to make that practical for today's
computers), but as a function in the form of a neural network. None the
less, the function produced a value from 1 to 0 which represented the
probability of the computer winning the game from that board position.
When the program made a move, if it led to a board position with a high
expected probability of winning, then that move was "rewarded". Such an
action was seen as "good" by the program. The move was it's own reward,
becuase the move produced something the program _instantly_ recognized as
"good".
Now in the case of this game, the program can predict with 100% accuracy
how the environment will change before it makes the move. So it knows even
before it makes the move how "enjoyable" the result will be. In a more
complex and real example, the agent can't know for sure how the environment
will change ahead of time, so it has to select an action, and then wait to
see how much "joy" it got out of it once it finds out how the environment
has changed in response to the action.
Everything that we can sense, becomes part of that secondary reinforcer
prediction of future rewards. Not only do we sense how the environment
changes in response to our actions, we sense things like how our arms move
in response to the commands from the brain. When we see that our arms have
picked up the trash and correctly placed the trash in the trash can, we get
a joy out of this whole process - out of seeing our arms move as we wanted
them to, and in reaching the final state of having the trash relocated from
the floor to the trash can. So everything we are able to sense about the
entire process acts as a secondary reinforcer to reward, and strengthen,
our actions. Everything about the activity was the reward for our action
(assuming it all produced good things - which of course it doesn't always
do).
> Keep in mind that most people do not enjoy problem solving the way an
> engineer does - that's why most people are not engineers, nor aspire to
> be.
Yes, that's very true.
> OTOH, artists also solve problems, and engineers, significantly
> enough, very rarely like to do what artists do. Yet both are problem
> solvers. They solve problems in different ways. But they both engage in
> self-reinforcing behaviours.
Yes, we all develop our own complex system of secondary reinforcers.
I don't paint because my efforts at painting in the past didn't produce the
sort of rewards that my efforts at mechanical problem solving did. My
system of secondary reinforcers - my brain's prediction of what is "good"
for me, will be very different than what other people's brains predict is
"good" for them.
> What is a self-reinforcing behaviour? It's a loop. The agent gets
> feedback from its own behaviour, not just from the environment. That
> feedback is a reinforcing signal, to use your terminology. But note that
> this feedback is not one of "value in the future" (to paraphrase what I
> think you mean by "recognising value.")
Oh, it's very much a feedback of estimated future rewards. The value we
recognize is in effect all about future rewards. Nothing is immediate as
John would like to believe it is by oversimplifying the idea of immediate.
It's all an estimate of expected future rewards or "returns" as per the
language of reinforcement learning...
http://www.cs.ualberta.ca/~sutton/book/ebook/node30.html
> > This notion of trying behaviors and favoring the ones that appear best
> > is, as I said, translated into precise algorithms in the rest of the
> > book, so even if it sounds like hand waving to you on that page, it's
> > not in the least bit hand waving by the time you finish the book.
>
> Ptolemaic epicycles.... Fun, and even predictive, within the error range
> of observations at the time. (Did you know that observationally there
> was initially do discernible difference between Ptolemaic and Keplerian
> predictions?
No, I've not studied that history.
> Tycho Brahe was able to refine his observations to the
> point where it was possible to argue, but not prove, that Kepler's model
> was more accurate. That more precise observations that put Ptolemy's
> model to rest came later.) IOW, this approach will result in useful
> machines, but will not IMO solve the problem of artificial learning
> (which it seems is a synonym for artificial intelligence.)
Well, I argue that learning is intelligence and intelligence is learning,
but not everyone supports that view.
Machine learning however I don't consider to be artificial. It's
artificial if you claim to to be a model of human learning, but I just
claim it to be intelligence. Just like a machine that walks on two legs is
not artificial walking, it's just walking.
The question at hand which we have not answered is how close to human
beahvior can we get in a machine by implementing a reinforcement learning
algorithm? I believe we will get so close to human behavior from the
machine that none of these points will be debated in the future. People
will simply understand the human brain to be a reinforcement learning
machine controlling our actions.
But if I'm wrong, and it takes a lot of specialized modules which simply
include some elements of learning (maybe many different elements of
learning) alone with lots of other specialized features (as people like
John and Dan argue), then no one will consider the brain to be _just_ a
reinforcement learning machine just as no one considers a plane to be
_just_ an airfoil. The airfoil that provides lift is just one of many
elements that make up a working airplane.
> >> The authors state in their introduction: "Rather than directly
> >> theorizing about how people or animals learn, we explore idealized
> >> learning situations and evaluate the effectiveness of various learning
> >> methods." Anyone who believes they can construct idealised learning
> >> situations without theorizing about how it's actually done by humans
> >> and other animals is not likely to produce much of value.
> >
> > Note that the Sutton book is about computer learning algorithms and not
> > about humans or animals. When he uses the term "reinforcement
> > learning" he is NOT making direct reference to human learning, he's
> > talking about the very specific field of computer research into a class
> > of _machine_ learning algorithms which are researched by people such as
> > himself.
>
> I'm aware of that. The enterprise hasn't gotten very far, though. The
> best it's been able to do is make smart washing machines and cunning
> digital cameras. That is certainly AI, but very limited. Mostly, I don't
> see any obvious way to generalise these machines.
Yeah, that's the rub. What works today, seems to still be 100 miles away
from human beahvior. I think it's actually extremely close but most don't
understand how close we were until we look back in retrospect after it's
been solved.
> OTOH, it's quite possible (even likely IMO) that these limited machines
> will turn out to be components of generalised learning machines of the
> type you seem to be pursuing.
My very strong belief is that strong generalize learning machines will show
obvious (and shocking to most) levels of intelligence once they are
implemented correctly. I believe once these machines are created, it will
be a sudden, and highly shocking (to most) eye opening experience for
society in general.
It's much like trying to duplicate an encryption algorithm. Close doesn't
count. You can have 99.9% of the encryption algorithm correct, and it will
still be producing the wrong answer just as often as a random number
generator does. I think this is the same case with poorly implemented
reinforcement learning algorithms. We are very close, but they still look
like they have about as much intelligence as a random number generator.
Only time will tell if I'm right.
> > How close this class of computer algorithms is to what the brain does,
> > he makes no speculation about in the book as far as I know. What you
> > quote above is him making it clear he's not attempting to research or
> > describe human learning in the book.
> >
> > I'm the only one here making the bold hand-waving claim that human
> > intelligence _IS_ a reinforcement learning (in the computer science
> > sense) process. It's my claim, not Sutton's, that got John to look at
> > that book, and then quote it in _our_ context of debating the human
> > brain (because it's the foundation of _my_ debate, not his, and not
> > Sutton's).
>
> I think you are half right. We are "reinforcement learning machines",
> but not in the computer sense. Computers are still linear machines -
> they appear to do loopy thinking sometimes only because they do several
> linear task very fast one right after the other. This is so even for
> multi-core machines. So far. We are not linear machines.
Well, I agree completely in terms of how most our software systems are
written. But I don't believe that has anything to do with computers in
general - only in how we tend to structure and write our software.
The type of programs I play with have none of the "linear behavior" you are
thinking of. Though the code is, at the low level very linear - as it must
be per the design of the computer, it creates a simulation of a higher
level process (parallel signal processing network) which isn't linear in
the least.
> > You might have been explicitly talking about operant conditioning in
> > humans or the like when you asked john about reinforcement learning,
> > but what he quoted you was not a book on human behaviour or human
> > learning, but a book about computer algorithms.
> >
> > I don't actually know what Rich Sutton thinks about the connection
> > between RL algorithms and human intelligence. His work is in AI, but
> > whether he thinks RL research will explain full human intelligence as I
> > do, or not, I just don't know.
> >
>
> I'll be reading more of the book.
>
> I'm mulling over a block diagram of an operant-conditioning capable
> machine. If it ever gets to the point where I think it may work, I'll
> get back to you.
Sure, sounds like fun.
My block diagram is a pulse sorting network trained by reinforcement with
global feedback (and local feedback inside the pulse sorting network).
Though it's implemented as a highly linear system at the lowest level of
sorting pulses one at a time, it's doing this pulse sorting a million times
a second, and each of these pulse sorting paths though the network
constitutes a few hundred or more decisions the system has made. Each node
in the network is in effect its own parallel learning process so a
simulated network with a billion nodes (well within the reach of our
current computer technology) is actually simulating the behavior of a
billion parallel independent and communicating learning processes. Each
node is its own micro reinforcement learning machine, and the behavior of
the entire network is the combined behavior of the society of all these
micro machines working together in parallel in real time.
It's really nothing like how traditional software is structured (but very
much like how many neural network programs are structured). It produces
behavior very much unlike what we normally see on our computers.
So you posit there is stuff there that "we can't see" which explains the
stuff we can see? But yet, you don't have any specific guess as to what
that stuff is other than innate functional hardware designed by evolution?
That's not really saying much at all.
> > ... then there's no point in playing with innate modules
> > because we will have no clue if any of the innate modules
> > will be of use to the generic learning hardware.
>
> I think the 3D world of objects is useful for the generic
> learning hardware.
Sure, and at that level it sounds useful. But just how does raw data get
translated into "3D world of objects" and how might such a thing be
represented in the hardware? Unless you have some sort of specifics what
what that means in hardware terms, you aren't really saying anything that
isn't obvious from just knowing we are humans that model reality as a 3D
world of objects.
> > How, for example, would you wire up a chess program to a
> > generic reinforcement trained learning module so that the
> > learning module could make use of the code in the chess
> > program? The most likely answer is that it can't - that
> > we would have to throw away all the chess code because
> > it was structured in a way that is absolutely of no use
> > to the generic learning system.
>
> Clearly we don't consider a chess program as a basic module
> for a generic learning system. It could be used that way if
> winning chess had some survival value but this is not why
> we play chess. The game itself has no survival value and
> our generic learning system can do without it.
>
> > The environment one dog is born into can be totally different
> > from the environment his parents were born into.
>
> I don't think it is *totally* different.
That's true. "totally" is too far of a stretch. It's better said as
"contains important major differences".
> As for your honey bee
> a passage way is a "new" environment but its innate obstacle
> avoidance abilities work fine. If you think dogs fit into the
> "new" human environment because they can learn I would think
> again and ask yourself why we need dog trainers that can read
> innate dog body language. Dogs have innate behaviors of a pack
> animal which is useful when it comes to living with people.
> Cats fit well into a human society but have a different set
> of innate behaviours. Unlike dogs cats seem to train humans
> rather than the other way around!
:)
I do actually believe that dogs and cats have far more innate beahvior and
far less learned behavior than humans.
> > Spiders (as far as I know and I don't know much about them),
> > have no ability to learn how to build a different type of
> > web in order to adapt to the insects it's trying to catch.
> > Whatever behaviors it has for building a web, it's born with,
> > and if that style web stops working because the local insects
> > are too big for that web, the spider just dies because evolution
> > can't re-design the web-building-behaviour fast enough to allow
> > it to adjust.
>
> Species can learn just as brain can learn in real time. The
> spiders will evolve new behaviors rather than die off.
Which is why I keep talking about a species as a reinforcement learning
machine. It learns using the same type of process the brain uses - it's
just much slower. It's also why I argue that evolution IS an example of
intelligence in action and why I keep saying the ID guys are in fact right
about life being created by an act of intelligent design. They just happen
to fail to understand that evolution is the intelligence they are looking
to understand.
> > I have behaviors for navigating my house, and my yard, and my
> > city that don't exist in anyone else. If evolution tried to
> > hard-wired those behaviors as innate, they wouldn't work for
> > someone living in the next city.
>
> What is hard wired is the ability to make mental maps of whatever
> environment you are in. Each city has some important things in
> common. They both exist as objects in a 3D world for which you
> have the innate ability to deal with. Just a few tweaks here and
> there to adjust to different spatial arrangements is all that is
> required when it comes to navigating a new city.
Ok, if you believe that to be true, can you suggest what sort of hardware
implements this hardwired function? That is, other than just saying it IS
hardwired, what would the hardware be doing to implement this hardwired
function? How would such a module work? And how would it become "tweaked"
to deal with the part that was learned?
> > The bulk of human behaviour is learned, and not innate.
>
> At the high level, yes, I agree with that.
>
> > And the low level innate stuff that's there, is so low level,
> > it's not significant in it's effect on changing or helping
> > our basic power to learn.
>
> It doesn't so much help our basic power to learn as rather it
> provides us with a 3D world of things to learn about. I do not
> consider vision to be "low level" that is why it is so hard
> for us to write programs that can "see" but have no problem
> with "difficult" problems like programs that use calculus.
When we refactor a software program or a hardware design, we often find
functions that once happened at the high level now happen at the low level,
and functions that were low level, are now high level. The level at which
a feature exists is vary tentative and totally subject to interpretation.
> > Building an innate walking hardware doesn't get us any
> > closer to finding the right innate hardware which makes
> > learning easier. It only makes learning to walk easier,
> > while making learning to dance, nearly impossible.
>
> You seem to see these behaviors as rigid? It is easier to
> make modification to a walking behavior than to produce a
> new kind of walking (dancing) from twitching legs. Once
> you have a system that can balance while walking you have
> the ingredients of a system that can balance while it dances.
Yes, I agree with that. But that's true even if the walking had to be
learned first instead of innate. Once it learns to walk, some basic
dancing is not far away. After some basic dancing is learned, more
advanced dancing is not far away, on and on, 200 levels of dance learning
later, we have a seasoned professional dancer. Having the first step
(walking) out of 200 as an innate feature is not going to make learning
those last 199 beahviors any easier. It simply removes one out of 200
steps from the requirement of being learned.
Making such a thing innate has it's survival advantage (we can walk
sooner), so such things MIGHT be innate in humans, but it's not relevant to
solving AI, because it's not the ability to build have an innate feature
that makes us intelligent - it's the ability to learn those 199 other
beahviors on top of whatever you start with (and on top of the 198 other
learned dance beahviors) that makes us intelligent.
You keep talking as if these innate futures give the learning system a huge
boost up, but the best you can do to describe these features is to say we
can't see them (but they are there and making learning work).
To create a machine that can walk, we need a circuit with control over the
muscles we use for walking and balance, and we need feedback from various
sensors like the balance sensors and I think there might be some feedback
straight from the muscles that indicate how much tension are on them
(something new I just heard about). What I can believe we have as innate
"help" is not only the full package of useful sensors, but also to have
them potentially pre-wired to a specific part of the general learning
material so that the learning hardware has all the data it needs to build
the "walking circuit" without having to find and make odd connections all
the way across different parts of the brain. In other words, we could have
an innate part of our generic learning brain pre-allocated for the purpose
of controlling walking and balance so that with normal development, that
same module would end self-configuring itself into the required walking
circuit very quickly. That type of approach is how the brain can have a
strong innate ability to learn to walk, while at the same time be using
highly generic learning hardware to build the "walking" module.
> > If you think it's important that evolution supply a rich
> > set of low level innate functions for learning to work
> > with, you HAVE TO SUGGEST what those low level innate
> > functions are, and you have to suggest how a learning
> > system could possibly make use of them.
>
> Well I don't think the innate functions could be called
> low level although if you break them down, as you might
> with a computer program, you will end up with some low
> level behavior such as a neuron pulse.
>
> I would suggest we have the innate ability to see certain
> things such as objects in a 3D world. Do you want me to
> explain how a learning system might use such objects in
> a 3D space-time framework?
I want you to explain how such thing might be encoded and represented in a
pulse network like the brain. How would you expect to find the fact that
there's a book on the desk in front of me encoded and represented in pulse
signals by this innate 3D object recognition system you envision which
exists in my brain? If we had the power to probe the brain checking the
different signals we find there, what would we see when we found where
these objects in 3D space were represented in the brain?
> Of course these systems have to be fine tuned and perhaps
> depend on patterned input data to develop. Information for
> building them would come not only from the DNA but also
> from the environment. For example to develop stereo vision
> you need stereo data. For this you need two eyes and of
> course exist on this 3D world of objects.
Yeah, it's easy to show that "tuning" is required. No two human eyes will
have the same layout of sensors so the brain hardware that processes the
data can't make any assumptions about which sensors will activate when a
straight edge is projected onto the back of the eye. That has to be
learned by some "tuning" process if not by the generic pattern recognition
process I argue is causing it.
> > If there's an innate walking function, how would the
> > learning system make use of it? How would it learn to
> > skip, or jump rope by making use of the "walking" hardware?
> > It's one thing to say they are used like we use a library
> > of subroutines. But how we, as intelligent talking humans,
> > make use of a library of subroutines is of course a whole
> > different issue from how some low level learning hardware
> > makes use of a "library of subroutines".
>
> And how did you learn to skip?
I have no memory of the event. I can only guess.
> Was it from scratch or did
> you start off with twitching legs?
It no doubt was leaned after I learned a few 100 other important leg
behaviors, like kicking, and raising a leg to turn myself over, like using
my legs to keep myself sitting up, like using my legs to take a blanket off
of me, like using my legs to crawl, to learning to stand, to learning to
walk, and somewhere way down the list of things to learn, spiking showed
up.
> Or was it using the "jump"
> routine?
It was probably using the "routines" of all previous learned leg behaviors
at the same time.
> And was this "program" written by a verbal description?
It might have been regulated by verbal behaviors. That is, it might have
happened after I talked to someone about skipping (or after a teacher
talked about spiking) and probably happened after I saw someone else doing
it (but it could have been discovered on it's own for all I know).
> With practice this kinetic program becomes embodied in the
> brain to be triggered whenever required.
Sure. But again, we are clearly talking about one more of 1000's of
learned leg and body beahviors all built on top of each other, layer after
layer. You can't skip until you can hop on one foot for example.
Funny - I just had to get up and skip around the house to make sure I
understood what it was - probably haven't done that in 30 years! Took me a
few false starts to get it right! :)
The point I was getting at however, is if the first behavior you learn like
walking is in fact learned, then when we later learn to skip, we can
understand it as a _modification_ to the walking algorithm. But if walking
is innate, then it can't be modified (that's what innate generally means),
which means the only way to learn to skip, is to disable the innate walk
program, and learn a new skip program from the ground up.
I don't know if this argument applies to your position becuase you have so
far been to vague about what the innate stuff you keep talking about is
actually doing.
> > Even after getting a dump of what TD-Gammon learned (the
> > neural network weights), we can't make any use of it to
> > help us write a better Backgammon program.
>
> Maybe one day we will figure out how to write a computer
> program to extract the logic in the weights.
Well, to be honest, I don't think there's anything to be extracted. They
simply _are_ the program and can't be reduced to anything simpler. They
are a program that's too complex for a human to understand.
> Although the use of a neural net to do multivariate statistics
> to collect a set of weights worked well for backgammon as with
> any tool is good for some things and not for other things. Have
> you bothered to find out why GOFAI works better with chess?
First, I'm not at all sure it does work better in general. That's yet to
be seen.
But more important, chess is a highly artificial environment that shares
almost nothing in common with the problem of producing behavior in the real
world. As such, you would expect a program optimized for solving one
domain would be very different than one optimized for the other.
The main difference that exists in backgammon is the dice roll. Because of
that, you can't "reason out" a move by following a game tree using a min
max algorithm like you can in chess, or checkers, or tic tac toe. You have
to calculate probabilities distributed over the game tree. This is why
backgammon is far more like the real world than these other games. It's a
domain where there are major factors at play which can't be predicted (the
next dice roll).
Interacting with the real world is much the same becuase no matter how good
the agent becomes at predicting the future, what actually happens most the
time will be a dice roll. All the agent can ever really do, is estimate
and work with expected probabilities. This is why I think TD-Gammon is far
closer to the type of solution that's needed to create AI than the type of
algorithms we typically find in chess programs.
I don't know how it happens in the brain. We know that the
awareness of the spatial relationships between objects takes
place in the parietal cortex and the identification of the
objects take place in the temporal cortex. The so called,
where and what data of objects. For example in the target
program the identification data is independent of the actual
position of the target. The position data and identification
data is separate.
> ... can you suggest what sort of hardware implements this
> hardwired function? That is, other than just saying it
> IS hardwired, what would the hardware be doing to implement
> this hardwired function? How would such a module work?
> And how would it become "tweaked" to deal with the part
> that was learned?
How it happens in the brain I don't know but in a program
we can list the objects (or features) where their positions
are variables of the objects. Something the same may be
possible in an associated network of features where positions
are one of those features.
> I want you to explain how such thing might be encoded and
> represented in a pulse network like the brain. How would
> you expect to find the fact that there's a book on the desk
> in front of me encoded and represented in pulse signals by
> this innate 3D object recognition system you envision which
> exists in my brain? If we had the power to probe the brain
> checking the different signals we find there, what would we
> see when we found where these objects in 3D space were
> represented in the brain?
I don't know how the brain actually works so how it encodes
or represents things is unclear.
>> > Even after getting a dump of what TD-Gammon learned (the
>> > neural network weights), we can't make any use of it to
>> > help us write a better Backgammon program.
>>
>> Maybe one day we will figure out how to write a computer
>> program to extract the logic in the weights.
>>
>> Well, to be honest, I don't think there's anything to be
>> extracted. They simply _are_ the program and can't be
>> reduced to anything simpler. They are a program that's
>> too complex for a human to understand.
That is why I was interested in starting with something
simpler than backgammon by using a ANN for tic tac toe.
Instead of a large table of ttt states and values generated
by many random games it should be possible to use a smaller
set of weight values in an ANN.
If the machine is using the symbol x then the state,
+---+---+---+
| | | |
+---+---+---+
| | x | |
+---+---+---+
| | | |
+---+---+---+
would have a high value.
People think in two modes.
Why is the center square of a ttt board the first one to take?
Crispy explanation based on logic:
Because it offers the largest possible win configurations.
Fuzzy explanation based on many trials:
Because it has a high probability of resulting in a win outcome.
A generic hidden layer network requires hundreds, thousands
and maybe millions of training sessions to build up weights
because they cannot get to the solution by means of rules,
rather it needs many samples so it can interpolate between
the examples it has tried. Every substantially different
kind of example must be tried otherwise the network will
fail to interpolate correctly that is why you need so many
examples unlike the way we would learn to play a game of
backgammon.
JC
The human is furnished by the genome with a set of motor program
generators that suffices for survival in the expected milieu. (That
milieu which existed in previous generations.)
Breathing – ventral medulla
Orofaciopharygeal movements…facial expression, vocalization, licking,
chewing, and swallowing – parvicellular reticular nucleus
Reaching, grasping, and manipulating – cervical enlargement (spinal
cord)
Orienting movements…eyes (oculomotor) – dorsal midbrain reticular core
Head and neck – cervical spinal cord
Posture – spinal cord
Locomotion – spinal cord
The specific circuitry in vertebrates has not been worked out, but a
number of invertebrate generators have been. The worked out circuits
(at the neuron level) are highly satisfactory to an electronic
engineer. These (vertebrate) motor program generators are the
foundation for all human behavior. All motor acts (all, all, all)
involve the triggering of a motor program generator. Especially note
phonemes (under vocalization). We hear these phoneme generators being
triggered when the infant babbles.
> > > The environment one dog is born into can be totally different
> > > from the environment his parents were born into.
>
> > I don't think it is *totally* different.
>
> That's true. "totally" is too far of a stretch. It's better said as
> "contains important major differences".
If a dog is born into a milieu that differs in any important respect,
it dies.
> > > If you think it's important that evolution supply a rich
> > > set of low level innate functions for learning to work
> > > with, you HAVE TO SUGGEST what those low level innate
> > > functions are, and you have to suggest how a learning
> > > system could possibly make use of them.
See above for motor program generators. The methods of modification
that allow a walking genrator to produce dance requires another post.
Ray
I would suggest reading "Life's Other Secret" by Ian Stewart
as it has an interesting chapter 9 on central pattern generators
and how gaits are generated. He mentions the skip as arising
from the walk by way of a secondary phenomenon called period
doubling. The same central pattern generator CPG can control
different gaits simply by increasing the rates of stimulation.
The gaits will change from a walk to a trot to a gallop.
It was discovered a long time ago that when stimulated the
mesencephalic locomotor region of the brain stem produced
normal walking in cats. It is believed this is the command
center for walking rather than holding actual walking
programs.
Lower systems are modulated by higher systems so how we walk
is effected by how we feel. We can leap for joy or plod along.
Ian Stewart's book mentioned above might be found in a library.
JC
That's all good stuff. But in order to speak, or walk, or ride a bike, ro
drive a car, or play a real time sport game, it's not these innate control
circuits that makes it possible. That is, the generic learning sections of
the brain still have to generate a very high volume flow of "triggers" (if
that is what you want to call them), to all these different "motor program
generators".
With those sorts of innate features available, it can clearly make learning
to walk on two legs far easier than there was no innate support. We have
had to walk for millions of years so having innate support for walking, or
grasping, or many other basic functions is easy to understand. But when we
look at what a an adult human has learned in their life time, we see that
MOST of it CAN'T be innate. Grasping can't be innate, because we have to
modify and control our grasping behaviors and modify and control when we
use them at all. But it can have lots of innate _support_ in that that the
system can be pre-wired with all the needed sensors and signals paths
needed to crate a smooth grasping motion.
But grasping is insignificant to the act of building a house with our
hands, or compared to driving, or performing the act of delivering
packages.
Just step bank and look at what innate hardware we have t build into a
robot. We have to give it some motors to control it's arms and legs. And
we have to supply some "innate motor control commands" for it in the
hardware so that the computer sends the motor control system one command to
spin at a given speed.
How different is the motor control which ways "spin clock wise at 10 RPM"
from "reach out arm", and "grasp hand"?
To program the "reach out arm" behavior we would have to write a few 100
lines of code to produce a correctly time set of "spin motor" commands to
our robot. We would have to write some small OO object to control that
behavior.
How much code do we have to write to create a "grab" "motor program"?
Maybe a few 100 more lines.
How much code do we have to write to make our robot deliver packages for
UPS? We have to add to that 100 lines code about 100 million more lines of
code.
If the human body includes innate hardware to support the "reach out arm"
and the "grab object" beahviors, we have only reduced the amount of code
which the generic learning must create from 100 million, to 99.990 million.
We haven't put even the slightest dent in the complexity of the learning
problem that we must solve before we have solved AI.
How much innate features exist in the brain to help shape human beahvior is
just not significant to the learning problem because we know that no matter
how much it is, it's totally insignificant to the complexity of the
learning problem we face. And if we create a learning algorithm strong
enough to write 99.999 million lines of code, why on earth would we choose
NOT to let it write that last 10,000 lines of code as well? Unlike humans,
robots don't have a survival problem. We can protect them while they learn
to walk even if it takes a little longer. And once they have learned to
walk, we just copy the result to all the new robots as innate features they
are born with.
When we write or design complex hardware, we end up creating many levels of
abstractions in the design, or in the code. It's not just "high level" and
"low level". A typical computer these days will have 20 levels of code or
more in them at least. We have the device drivers, and the file systems,
and process scheduling, and IO buffers, and line buffers, memory
management, character mapping, XML format interpretation, data formulas in
a spread sheet, window frameworks, object garbage collection, button code,
font code, and on and on, all to make the simple translation of a key
stroke in a spread sheet show up a changes on the screen and on the disk.
We can imaging how much code a robot must have in it in order for it act
with all the intelligence of an adult humans, and we can imagine how much
code it must have in it to act like a baby, and we can do a simple diff.
What we see, if you look at it with any honestly, is that 99% of the code
is in that diff and that 99% of the code has to be created by some very
advanced strong generic learning system that can write _any_ code needed to
control the machines behavior.
The type of innate features we are talking about for humans is similar to
the innate features we find in our PCs - like all the functions performed
by the specialized hardware - like the disk drives that you can "trigger a
motor control program" to make them "write a block of data" instead of
having to write code to control the disk directly. And all the innate
function we find in a network card or in the video card which we just send
"commands to trigger motor programs" to them.
People build robots with a few low level innate behavior sequences such as
"drive forward", and "turn 90 deg right", and "stop", and "drive
backwards". You can then add high level learning to that to select a
behavior from the small set of 10 "innate motor control functions". When
you do that, you see the robot doesn't act at all as if it were alive. The
transition from one motor control to the next happens as each one finishes.
And if the sequence each last 1 second, then we see the machine picking an
action, acting for 1 second, then picking the next action.
Humans don't make high level decisions once a second. To be able to learn
the things we can learn, the generic learning system must have the power to
make action decisions 100 times a second. It must have the power to stop,
start, or adjust, all these "innate motor control programs" at the rate of
about 100 times per second for every part of our body - which adds up to a
requirement to make thousands, if not millions, of _learned_ behavior
decisions every second. And to make those decisions, it has to process a
huge flow of input sensory data to and respond to.
The amount of complexity produced in the output motor control system, or
the input sensory pre-processing, is not significant to the fact that after
all that innate help, we will have the same (huge) size learning program to
solve. The innate support might help survival by reducing learning time
for very trivial behaviors like garbing food and putting it in our mouth,
but it doesn't do anything important to help us learn how to make a living
by playing the piano which could take 10 years of learning.
> > > > The environment one dog is born into can be totally different
> > > > from the environment his parents were born into.
> >
> > > I don't think it is *totally* different.
> >
> > That's true. "totally" is too far of a stretch. It's better said as
> > "contains important major differences".
>
> If a dog is born into a milieu that differs in any important respect,
> it dies.
Well, it needs a source of food in the environment and it's skill at
getting food is mostly just "eat it if it smells good". If there is no
food in the environment the dog can get just by following it's nose and
biting, it's pretty much done for. And likewise, if there are any dangers
in the environment which the dog's innate instincts can't deal with, it
will again die (like cars). SO sure, dog's don't have much adaptability.
But they can learn tricks like how to get a door open to get the food
located behind it. And that's not becuase the ability to open the door was
an innate feature of the dog's behavior. It's because it was within the
range of what it could learn.
> > > > If you think it's important that evolution supply a rich
> > > > set of low level innate functions for learning to work
> > > > with, you HAVE TO SUGGEST what those low level innate
> > > > functions are, and you have to suggest how a learning
> > > > system could possibly make use of them.
>
> See above for motor program generators. The methods of modification
> that allow a walking genrator to produce dance requires another post.
Yes, and even without your post, it's easy for me to grasp what you are
talking about because you are being specific. John never has been.
And more important, your specifics don't change the argument I have with
John - which is the argument that these innate features don't change, and
aren't important, to the size and complexity of the learning problem we
must solve.
When we build a robot we must provide a control system so the computer can
control the effectors. At a very low level, we could give the computer the
power to turn power to the motor on and off, with no ability to regulate
the power. This would force the computer to have to pulse the power to the
wheels at a very high speed in order to control the speed.
Or we can build that ability to pulse the power to the wheels into a motor
control board, and then allow the computer to simply send a command to the
control power to set the current power level. The innate motor control
board them pulses the power on and off based on the last command it
received from the computer. That's just an example of moving some of the
"motor control" out of the computer an into the innate hardware.
The robot might need the power to make a 90 deg turn. With the above, the
computer would have to send the correct commands to the wheels for the
correct amount of time to make the robot turn 90 deg to the right.
Or we can move that function into the innate hardware and allow the
computer to just send the command "turn right 90 deg" and then let the
innate hardware make that happen.
Looking at the external beahvior of the machine, we can't really tell how
much of the function was in the computer, and how much was in the motor
control board. Either way, the robot ends up turning 90 degs to the right.
But if the compute has the power to _learn_ to turn only 10 degrees to the
right in response to some stimulus, then we know the learning problem is
such that even if turning is part of the innate features of the robot,
learning to use it only lone enough to make a 10 deg turn instead of a 90
deg turn is something that must be learned.
And if the robot likewise has the power to learn how to navigate a large
maze to find power, we know the behavior for that maze is no innate.
If we look at how much human behavior can't be innate, becuase the need to
create the behavior hasn't been in our evolutionary history long enough for
it to become innate, and compare that to what could be innate, we still see
that MOST of our beahvior is learned. Which means whatever the line is
between innate support, and learned, is not significant to the magnitude of
the learning problem we are still left with to solve. Adding innate
support of the lowest levels doesn't make learning any easier becuase the
amount that could be innate in humans, is insignificant to the amount that
must be learned.
There is good reason why the neocortex is so large comped to the rest of
the brain in a human - because it holds all that stuff that has to be
learned after birth, and for humans, that's a significant amount of
circuitry that has to be build by some strong generic learning technology.
The real complexity of the learning problem comes down to how many
decisions it has to make per second and what resolution of "understanding"
is it able to translate the state of the environment into (like how many
pixels a camera has). And if if you postulate a lot of innate support to
reduce how many decisions per second the learning system has to make, or
how much raw sensor data it's working with, the size of the learning
problem is still huge - far larger than any of our current algorithms can
deal with - which is why we need a new type of learning algorithm that can
deal with the scale.
Adding innate modules only reduces the number of decision per second the
system has to make, or reduces the amount of raw data it has to work with.
But it can't reduce ether of those two things to a point that makes
learning suddenly easy because we know how many decisions per second and
how much data the brain has to work with in order to performed the "learned
tricks" it does and we know those numbers are orders of magnitude above
what our current simple learning algorithms can deal with.
A new born baby has innate grasping. Didn't you ever
offer your baby a finger to grasp?
> ... we still see that MOST of our behavior is learned.
How do you measure the amount of behavior?
> Adding innate support of the lowest levels doesn't make
> learning any easier because the amount that could be
> innate in humans, is insignificant to the amount that
> must be learned.
You are confusing difficulty with quantity. I have written
a lot of EASY high level code and very little DIFFICULT
low level code to be used by the high level code.
> There is good reason why the neocortex is so large
> compared to the rest of the brain in a human - because
> it holds all that stuff that has to be learned after
> birth, and for humans, that's a significant amount of
> circuitry that has to be build by some strong generic
> learning technology.
It is unclear where we "hold all that stuff" but it is not
simply data that is held it is extra machinery to do extra
processing that is provided by the neocortex. We can for
example process sound with finer distinctions than say a
dog which is why we enjoy music and dogs don't.
> The real complexity of the learning problem comes down
> to how many decisions it has to make per second and what
> resolution of "understanding" is it able to translate
> the state of the environment into (like how many pixels
> a camera has). And if you postulate a lot of innate
> support to reduce how many decisions per second the
> learning system has to make, or how much raw sensor
> data it's working with, the size of the learning
> problem is still huge - far larger than any of our
> current algorithms can deal with - which is why we need
> a new type of learning algorithm that can deal with the
> scale.
>
>
> Adding innate modules only reduces the number of decision
> per second the system has to make, or reduces the amount
> of raw data it has to work with.
These basic limits can be measured rather than allowing
yourself to be led astray by introspection. For example
we all experience the temporal illusion that our eyes
move smoothly over the text as we read. In fact it moves
in jerks called saccades. At least 0.2 seconds are
required for a fixation. Thus there is a temporal limit
to the assimilation of information. The fastest anyone
can saccade is 5 times per second.
Consider a simple stimulus/reaction such as pressing a
button in response to a light flash or a beep sound.
The fastest you can react is 0.1 seconds to a beep.
It takes 0.04 seconds longer to react to a visual
stimulus.
Now consider a decision reaction where for example the
reaction will depend on which stimulus occurs such as
pressing one button in response to a red light and
another button in response to a green light. It is found
that it adds 0.07 seconds to the delay between the
stimulus and the reaction. It takes extra time to make
the decision. Something not so in your smooth flowing
pulse networks? This extra time is the same for both
a sound or a visual stimulus so it is independent of
the type of stimulus and reflects the time taken to
make this simple decision.
Another situation which again does not match your
subjective feeling that pulses "flow" through the
brain is involves the temporal decision frame.
It is found in the simple decision reaction that a
decision can only be taken in fixed units of time.
It is as if the occurrence of a sudden event sets
in motion a oscillatory process in which only one
decision can be made per oscillation.
Lets take another example: the temporal ordering
of events. As the duration between two events such
as a flashing light is reduced a limit is reached
where only one flash is perceived. Now as you
increase the gap a point will be reached when you
can perceive there is two events but not perceive
the temporal order of those events!
JC
There is no simple nor correct way to enumerate behaviors, but the are many
ways it can be done. The answer is the same every time. Most of it is
learned. I'm almost dumbfounded that you don't understand this becuase
it's so obvious.
A behavior is a reaction to a stimulus and the only reactions we really
care about are the ones that show some value for the human (innate or
learned). Measure how many different ways an adult can react to different
stimulus and you will have a measure of the amount of behavior. Check to
see how many of those reactions existed in the baby, and you will then see
what was learned vs what was innate.
We can start with giving the person a finger and see if they can grasp it.
Baby has it, adult has it - 1 behavior for the innate site.
Then check to see if a baby knows how to turn an alarm clock off. Human
has it, baby doesn't - learned behavior.
Then check to see if they can write their name with a pen. Adult yes, baby
no.
Then check to see if they can answer the question "what is your name?".
Adult yes, baby no. Not innate.
Then check to see if they can open a fridge door to get food. Adult yet,
baby no.
Then check to see if they know how to pour a glass of milk. Adult yes,
baby no.
Then check to see if they can pick up food and put it in their mouth, adult
yes, baby no.
Ride bike in order to get food? Adult yes, baby no.
Ski down a snow slope? Some adults, yes, baby no.
Play golf? Adults yes, baby no.
Write Usenet messages? Adult yes, baby no.
Shop for a birthday present? Adult yes, baby no.
Go to school and get a degree? Adult yes, baby no.
Play a video game and win it? Adult yes, baby no.
Smile? Adult yes, baby yes?
So that's 2 innate beahviors and 13 learned beahviors. And of coruse4,
with something like playing golf, the adult actually had to learn about 100
beahviors to do that, but I'll just count it as one.
I can go on like this for hours and hours and list millions of behaviors
that exist in an adult human, and which doesn't exist in a baby, and which
can't be explained as innate because opening doors and pouring milk and
playing golf is not something evolution could possibly have built innate
motor control programs for.
Baby's have lots of small, but important innate behaviors and skills. But
what they have at birth, is insignificant to what they have as an adult.
The number one important innate behavior a baby is born with, is the innate
ability to learn new behaviors.
All of these skills that come so easy for us as adults were at one time in
our life, way beyond our ability. Each of these skills require a very
complex and very high resolution real time set of control circuits in our
brain to create (meaning they have to adjust our beahvior 1000's of times a
second very quickly in response to a huge flood of sensory information over
an extend period of time - such as for hours while playing a game of golf -
or many seconds while pouring a glass of milk). And those control circuits
are not innate - they were built by an innate power to learn - an innate
power to build complex control circuits in response to rewards.
> > Adding innate support of the lowest levels doesn't make
> > learning any easier because the amount that could be
> > innate in humans, is insignificant to the amount that
> > must be learned.
>
> You are confusing difficulty with quantity. I have written
> a lot of EASY high level code and very little DIFFICULT
> low level code to be used by the high level code.
I have no clue what you think "difficult" code is. Code is code. None is
more difficult than others. If I show you the billions of lines of code
that goes into a modern PC can you point out to me which line of the code
is the difficult code and which is the simple?
"difficult" is not in the code. It's in the eye of the beholder. It's only
difficult if you don't understand what it's doing. It wasn't difficult to
the guy that wrote it, it was just one more line of code the guy wrote in
his life.
What we don't have are strong learning systems that can build complex
systems that are equal in functional complexity to millions of lines of
code written by a human. Not a single learning algorithm I know of can do
that. Except of course, the learning algorithm at work in a human brain
which we have not yet duplicated in our machines.
> > There is good reason why the neocortex is so large
> > compared to the rest of the brain in a human - because
> > it holds all that stuff that has to be learned after
> > birth, and for humans, that's a significant amount of
> > circuitry that has to be build by some strong generic
> > learning technology.
>
> It is unclear where we "hold all that stuff" but it is not
> simply data that is held it is extra machinery to do extra
> processing that is provided by the neocortex. We can for
> example process sound with finer distinctions than say a
> dog which is why we enjoy music and dogs don't.
Yes, "all that stuff" can be talked about as "code", or "machinery" or
"data which is interpreted" or any other conceptual way you want to express
it. It's the same in all cases.
> > The real complexity of the learning problem comes down
> > to how many decisions it has to make per second and what
> > resolution of "understanding" is it able to translate
> > the state of the environment into (like how many pixels
> > a camera has). And if you postulate a lot of innate
> > support to reduce how many decisions per second the
> > learning system has to make, or how much raw sensor
> > data it's working with, the size of the learning
> > problem is still huge - far larger than any of our
> > current algorithms can deal with - which is why we need
> > a new type of learning algorithm that can deal with the
> > scale.
> >
> >
> > Adding innate modules only reduces the number of decision
> > per second the system has to make, or reduces the amount
> > of raw data it has to work with.
>
> These basic limits can be measured rather than allowing
> yourself to be led astray by introspection. For example
> we all experience the temporal illusion that our eyes
> move smoothly over the text as we read. In fact it moves
> in jerks called saccades. At least 0.2 seconds are
> required for a fixation. Thus there is a temporal limit
> to the assimilation of information. The fastest anyone
> can saccade is 5 times per second.
Sure, but what's important, is how much information is flowing into the
brain from the eyes, and how fast the brain has to make decisions in
response to that data, and how many decisions have to be made to control a
body when it's doing something like running down a hill, or driving a car.
Careful study could produce good numbers for these things, but even without
careful study, I now the numbers create a learning problem on scale many
many many orders of magnitude beyond what our current simple learning
algorithms can deal with.
> Consider a simple stimulus/reaction such as pressing a
> button in response to a light flash or a beep sound.
> The fastest you can react is 0.1 seconds to a beep.
Are you saying .1 seconds to push the button? So we are talking about how
fast the brain has to first, receive the sound, recognize the "beep"
pattern, then then generate the right control signals to make the finger
start to move, and get this big heavy finger moving far enough to press the
button, all in .1 seconds?
> It takes 0.04 seconds longer to react to a visual
> stimulus.
Meaning .14 seconds?
> Now consider a decision reaction where for example the
> reaction will depend on which stimulus occurs such as
> pressing one button in response to a red light and
> another button in response to a green light. It is found
> that it adds 0.07 seconds to the delay between the
> stimulus and the reaction. It takes extra time to make
> the decision. Something not so in your smooth flowing
> pulse networks?
Yes, my pulse networks do things like that. It's beyond your understanding
however so don't worry yourself with it.
> This extra time is the same for both
> a sound or a visual stimulus so it is independent of
> the type of stimulus and reflects the time taken to
> make this simple decision.
That is no "simple" decision. It's a high level decision triggered in
response to previous _verbal_ command. The person was instructed verbally
how to respond to the test and he's in effect running some verbal
interpretation process as he responds to the stimuls. It means the path
though the brain which the stimulus signal is taking is passing though
parts of the brain that control our verbal behaviors. The stimulus has to
activate the correct sections of our verbal hardware, which then has to
respond which then gets to the finger - just as when someone says to you -
push the red button, and then you respond by pushing the red button. That
works by passing though the parts of the brain that deal with responses to
verbal commands. In order to take test that was explained to us verbally,
we have to use those same parts of the brain. In effect, even when are
taking that test, we are making verbal decisions as we take the test.
If on the other hand, you train the person on this test long enough, I
strongly suspect the brain would rewire itself so it didn't have to use the
verbal parts of the brain and the reaction time would drop.
The simple decision you are talking about is complex verbal beahvior. The
simple decisions I was talking about is how our body quickly reacts to
visual clues of the moving environment to trigger our hand motions on the
steering wheel of the car.
> Another situation which again does not match your
> subjective feeling that pulses "flow" through the
> brain is involves the temporal decision frame.
> It is found in the simple decision reaction that a
> decision can only be taken in fixed units of time.
> It is as if the occurrence of a sudden event sets
> in motion a oscillatory process in which only one
> decision can be made per oscillation.
>
> Lets take another example: the temporal ordering
> of events. As the duration between two events such
> as a flashing light is reduced a limit is reached
> where only one flash is perceived. Now as you
> increase the gap a point will be reached when you
> can perceive there is two events but not perceive
> the temporal order of those events!
>
> JC
I think your thinking is highly muddled and confused.
You seem to be talking about tests that evolve discrete events like lights
flashing, and buttons being pushed and you only seem to be understand the
concept of a "decision" in the context of these very high level discrete
events.
Try testing humans on a continuous behavior event like balancing, or
driving a car, or catching a ball thrown to them. Make the human play a
real time video game that requires constant motion on the part of the human
to play the game, like moving a steering wheel or a joy stick or swinging
baseball bat to hit a curve ball.
Give the human lots of variations in the test so the road curves in a very
dynamic way and watch to see how the human responds with their smooth and
continuous set of adjustments to the joy stick or the steering wheel.
If you were to study the response curve of the human to the stimulus curve,
I'm sure you will find that the human in effect is not making 2 or 3
"decisions" a second, but in fact is making more like 100 of them per
second.
The brain works in parallel and is pipelined. It might take .2 seconds for
a human to response to stimulus change, but that doesn't mean they are only
making 5 "decisions" per second. It only means that for that type of
decision, there was a .2 second propagation delay though the brain. The
brain can be making some decisions in response to what the eyes are seeing,
other decisions in response to what he is hearing, other decisions in
response to what he his feeling, and so on. I can drive my car, talk on
the cell phone, drink coffee, and adjust the radio all at the same time.
All of those are continuous real time control programs running in parallel
which each take 100's of "decisions" per second in terms of what the
learning functions of the brain has to be controlling.
The "simple decisions" I'm talking about here, is the production of a
_single_ output pulse under the control of decision circuits built by the
learning function of the brain. The simple decision you were talking about
was a highly complex verbal response to a highly complex verbal question of
"press the red button when you see a dog, and press the black button when
you see a cat". That's not one pulse of output behavior, it's thousands of
them.
Better pass that onto the behavioral scientists that develop
these experiments.
> You seem to be talking about tests that evolve discrete
> events like lights flashing, and buttons being pushed
> and you only seem to be understand the concept of a
> "decision" in the context of these very high level
> discrete events.
Yeah much like Skinner and his rat's decision to push
a lever in response to a light flash or beep!
They are simple black box investigations as done in
behavioral science.
> Try testing humans on a continuous behavior event like
> balancing, or driving a car, or catching a ball thrown
> to them. Make the human play a real time video game
> that requires constant motion on the part of the human
> to play the game, like moving a steering wheel or a joy
> stick or swinging baseball bat to hit a curve ball.
I can see you giving Galileo a lecture on how his rolling
ball experiments are too simple to explain anything and
he should get out there and look at motion in the real
world which is more complex and can't be explained by
the simple motions of the balls in his simple experiments.
> Give the human lots of variations in the test so the
> road curves in a very dynamic way and watch to see how
> the human responds with their smooth and continuous set
> of adjustments to the joy stick or the steering wheel.
The smoothness is an illusion as is the continual glow
of a flashing LED light when the frequency of flashes
per second a greater than a certain limit.
> I can drive my car, talk on the cell phone, drink coffee,
> and adjust the radio all at the same time. All of those
> are continuous real time control programs running in
> parallel. Which each take 100's of "decisions" per second
> in terms of what the learning functions of the brain has
> to be controlling.
Again you are led astray by introspection.
It is true that some things can be carried out in parallel
but you have to actually do the experiments and time them.
Experiments have shown you cannot talk on the cell phone
and drive a car as safely as you could if you were simply
driving the car. It is called multitasking and is no more
parallel than multitasking in a modern computer. It is an
introspective illusion.
When two tasks compete for the same internal mechanisms
they require extra time when carried out together as
opposed to the time to carry out any of the two tasks
by themselves.
Clever experiments are setup to determine to what extent
two tasks compete for resources and which ones are serial
in nature and which ones are parallel in nature.
> The simple decision you were talking about was a highly
> complex verbal response to a highly complex verbal
> question of "press the red button when you see a dog,
> and press the black button when you see a cat". That's
> not one pulse of output behavior, it's thousands of them.
Whatever the "complexity" between the input and output
there is a fixed minimum time required to perform ANY
such task. It can of course take longer. If the task
is harder such as having to decide between two alternate
responses then a larger time interval would be required.
How "complex" the intermediate stages are is not at issue.
JC
> With those sorts of innate features available, it can clearly make learning
> to walk on two legs far easier than there was no innate support. We have
> had to walk for millions of years so having innate support for walking, or
> grasping, or many other basic functions is easy to understand. But when we
> look at what a an adult human has learned in their life time, we see that
> MOST of it CAN'T be innate. Grasping can't be innate, because we have to
> modify and control our grasping behaviors and modify and control when we
> use them at all. But it can have lots of innate _support_ in that that the
> system can be pre-wired with all the needed sensors and signals paths
> needed to crate a smooth grasping motion.
Different connotations of “innate” come to mind. One is that the
genome (in sequences of nucleotides) contains instructions that the
RNA reads and constructs proteins that cause a group of neurons in the
cervical enlargement to organize themselves as a grasping circuit. The
nucleotides were sequenced by evolution. This circuit (when triggered)
can produce a sequenced program of axonal pulses. This program
proceeds and is smoothed by the cerebellum. It continues to the
ventral anterior-ventral lateral complex in the thalamus, and then, if
not halted by the thalamic reticular nucleus, continues to the motor
cortex, and then on to the muscle fibers. The organism grasps.
A second connotation is of a soul (mind) with causal powers that
desires to grasp. It activates neurons to drive the muscles to grasp.
The end result is exactly the same. It all depends on our concept of
“innate”.
Ray
It's not the experiments you are making reference to which are at fault.
It's what YOU seem to want to conclude they indicate that I have fault
with.
> > Give the human lots of variations in the test so the
> > road curves in a very dynamic way and watch to see how
> > the human responds with their smooth and continuous set
> > of adjustments to the joy stick or the steering wheel.
>
> The smoothness is an illusion as is the continual glow
> of a flashing LED light when the frequency of flashes
> per second a greater than a certain limit.
My belief that behavior is "smooth" has nothing to do with the illusion of
smoothness in _perception_ that you are referring to.
When I wave my hand around it is not making jerking motions 5 times a
second like an eye saccade. It's moving smoothly and continuously though a
3D space. The various muscles in my body are likely making various jerking
motions as they control and correct the motion but the corrections in all
the muscles combined are happening at rates far in excess of 5 times per
second.
You have looked at these experiments and seem to be concluding that the
brain is "making decisions" at some low rate of something like 5 times per
second and that's just not justified by the type of experiments you are
talking about.
> > I can drive my car, talk on the cell phone, drink coffee,
> > and adjust the radio all at the same time. All of those
> > are continuous real time control programs running in
> > parallel. Which each take 100's of "decisions" per second
> > in terms of what the learning functions of the brain has
> > to be controlling.
>
> Again you are led astray by introspection.
John, I'm not talking introspection. I'm talking PHYSICS. It's impossible
to build a robot to perform such behaviors if all you do is update their
actuators 5 times per second. Do you honestly think you can build a robot
with arms and hands and make it play the piano as accurate as a human can
by using a control process that sends new commands to the arms and fingers
only 5 times per second?
You do understand that the complex control program that drives the arms and
legs to perform a task such as play the piano was learned right? Which
means the power of our generic behavior learning hardware has to have the
resolution needed to make a robot with 10 fingers play the piano as
accurate as a human right?
This is not a control problem that can be solved with a decision process
that operates at the rate of 5 decisions per second. Human behavior is not
a control problem that can be solved by a decision process that operates at
5 decisions per second.
> It is true that some things can be carried out in parallel
> but you have to actually do the experiments and time them.
I don't have to do an experiment to know that humans have the ability to
use their arms and legs in parallel for different tasks. Just look at
someone walking, or a drummer playing the drums, or anyone playing a some
sport game.
The brain is a parallel control system becuase it's got lots of different
body parts to control. Sometimes the body parts have to work together,
sometimes they have to work mostly independently. The amount of
coordination is a function of what the task requires.
> Experiments have shown you cannot talk on the cell phone
> and drive a car as safely as you could if you were simply
> driving the car. It is called multitasking and is no more
> parallel than multitasking in a modern computer. It is an
> introspective illusion.
The networks that create this behaviors are fully interconnected. What on
earth would you lead to believe that the "parallel" behaviors they produce
were 100% independent or what you lead you to believe I thought that? Of
course they are not. Of course these parallel tasks tends to share brain
sections and the more they share, the more they interfere with each other.
I can drive and talk on the phone at the same time, but I didn't not say I
can drive just as well while talking on the phone nor talk on the phone
just as well while driving.
> When two tasks compete for the same internal mechanisms
> they require extra time when carried out together as
> opposed to the time to carry out any of the two tasks
> by themselves.
The amount of internal mechanisms they share depend on the task. Some have
almost no conflict some have so much conflict they can't happen in
parallel.
> Clever experiments are setup to determine to what extent
> two tasks compete for resources and which ones are serial
> in nature and which ones are parallel in nature.
Sure. Of course.
> > The simple decision you were talking about was a highly
> > complex verbal response to a highly complex verbal
> > question of "press the red button when you see a dog,
> > and press the black button when you see a cat". That's
> > not one pulse of output behavior, it's thousands of them.
>
> Whatever the "complexity" between the input and output
> there is a fixed minimum time required to perform ANY
> such task. It can of course take longer. If the task
> is harder such as having to decide between two alternate
> responses then a larger time interval would be required.
> How "complex" the intermediate stages are is not at issue.
Yes, we have a reaction time that is limited by the speed at which
information can flow though the brain. Duh.
At the same time, the brain is a temporal pattern processing device. Which
means it can't respond until it has seen the full temporal pattern in which
it is responding to.
If I as you the question, "what is your name?" you can can correctly
respond to that temporal pattern which is probably about 1 second long.
But your response can't happen until I _finish_ asking the question becuase
your brain is responding to that full 1 second temporal pattern of sounds.
Likewise, the same sort of temporal processing can be taking place
internally. External temporal patterns can trigger the internal production
of a corresponding sequence of events which then trigger the correct
response. But since the response was triggered by some internal temporal
pattern of private events, it can't happen until after the internal pattern
has first been generated - just like how we can't answer a question until
after it's been asked - or recognize a word until after it's been fully
spoken.
As such, the amount of delay in a response could be of nearly any length
dependent completely on what sequence of internal events were triggered to
create it.
The minimum response time will be based on the speed of information flow in
the shortest path though the brain, but the maximum response time is nearly
unbounded. If someone asks, "show me a working solution to AI", the
response might take 100 years to produce! :)
Our use of the word "innate" in this thread means "not learned". It means
nature instead of nurture. It means a functional feature of the human body
which developed (at least mostly) due to the internal physical structure of
the body (most importantly but not limited to our genes) instead of in
response to external features of the environment.
We have been debating the question of how much, and what sort, of innate
features we have to build into a machine in order to make it act like a
human. My argument is the only really important technology we need to
create to solve AI is strong generic learning. The rest of it will be
simple and obvious by comparison once we have strong learning systems.
John believe humans are far more complex than just generic learning systems
and that to get close to human beahvior we will need a lot more in the way
of innate support hardware (but is normally fairly vague as to what the
required extra hardware actually does).
> John believe humans are far more complex than just generic learning systems
> and that to get close to human beahvior we will need a lot more in the way
> of innate support hardware (but is normally fairly vague as to what the
> required extra hardware actually does).
If you look at humans, some sources of genetic complexity are the
pre-processing of sensory inputs, and the way in which the function
of the whole brain is hackily modified by emotionally-triggered
neurotransmitters.
Also, if you dissect a brain, there is a certain amount of structure.
There are things like cortical layers, the cerebellum, and there
are many types of neurons, and many types of glial cells that modify
synaptic signals. The genetic complexity is non-trivial. We
don't know if it is all needed - but the task is probably not simple -
or we would be done by now.
--
__________
|im |yler http://timtyler.org/ t...@tt1lock.org Remove lock to reply.
> When I wave my hand around it is not making jerking
> motions 5 times a second like an eye saccade. It's
> moving smoothly and continuously though a 3D space.
> The various muscles in my body are likely making
> various jerking motions as they control and correct
> the motion but the corrections in all the muscles
> combined are happening at rates far in excess of 5
> times per second.
You jumped from eye saccades to arm movements when I
was talking about temporal illusions of eye movements
to illustrate how you need to actually do experiments
to know what the brain really does! I never suggested
arms move in 5 jerks per second.
> You have looked at these experiments and seem to be
> concluding that the brain is "making decisions" at
> some low rate of something like 5 times per second
> and that's just not justified by the type of
> experiments you are talking about.
I define making A decision as: the observable action.
When the button is pressed THE decision has been made.
You seem to be talking about lots of little decisions
taking place in the black box. I am talking about
what we actually observe, an output. If someone picks
up a cake we say they made a decision to pick up the
cake, a single event, we do not say they have just made
millions of internal "decisions" to pick up a cake.
If a rat presses a lever we say the rat decided to
press a lever we don't say many decisions resulted
in a lever press which is how I think you see it.
When we record the time delays between the stimulus
and the reaction we find peaks of about 0.24 seconds.
How are we to explain this preference and avoidance
of reaction times? Doesn't make sense if you think
of stimulus responses being continuous in which case
the response would have taken place at any time after
the stimulus.
> John, I'm not talking introspection. I'm talking
> PHYSICS. It's impossible to build a robot to perform
> such behaviors if all you do is update their actuators
> 5 times per second. Do you honestly think you can
> build a robot with arms and hands and make it play
> the piano as accurate as a human can by using a
> control process that sends new commands to the arms
> and fingers only 5 times per second?
As I indicated above, this is another example of the
personal way you use words. If someone presses a
button or lifts an arm that is ONE decision. It is
not about the millions of pulses involved to carry
out that ONE decision. Although I would hasten to
add I am not suggesting a particular part of the
brain makes that decision. It is in some sense a
group decision.
As indicated before decisions take place at preferred
intervals of time after the stimulus indicating an
oscillation is involved just as in a computer program.
If the game character doesn't move in this screen
frame it has to wait for the next screen frame, no
matter how many "decisions" are taking place between
frames.
> You do understand that the complex control program
> that drives the arms and legs to perform a task such
> as play the piano was learned right? Which means the
> power of our generic behavior learning hardware has
> to have the resolution needed to make a robot with
> 10 fingers play the piano as accurate as a human right?
Indeed we learn to play a piano. At the start we make
decisions as to what key to press. This is all "recorded"
for playback later. You are confusing this playback
with the decision making process that controls both the
learning and the execution of the process.
> I don't have to do an experiment to know that humans
> have the ability to use their arms and legs in parallel
> for different tasks. Just look at someone walking, or
> a drummer playing the drums, or anyone playing a some
> sport game.
Current programs in current computers are serial right?
They can control multiple outputs at what appears to be
at the same time. You DO have to do experiments. You
cannot detect the difference between multitasking and
parallel processing when the multitasking is too fast
for your sensory input to handle.
> The brain is a parallel control system because it's got
> lots of different body parts to control.
The brain is indeed a parallel control system but not all
tasks are possible by this parallel control system in
which case it has to resort to a serial process. Only
experiments on different tasks can tease them apart.
> The networks that create this behaviors are fully
> interconnected.
What experiments have made you believe this?
> I can drive and talk on the phone at the same time, but I
> didn't not say I can drive just as well while talking on
> the phone nor talk on the phone just as well while driving.
I think there is confusion as to what we mean by a parallel
brain. You see it as one bag of units. I see it as a collection
of useful modules, including modules for controlling modules.
The parts within modules are not all fully interconnected and
the modules are only -potentially- fully connectable if that
is required. Think of the object modules in OOP. It makes
sense that one module doesn't have full or uncontrolled access
to the parts of another module.
The problem is, you start with a belief about how the brain
is wired and then draw conclusions about the brain based on
those beliefs rather than experimental evidence.
> Yes, we have a reaction time that is limited by the speed
> at which information can flow though the brain. Duh.
But that wasn't the point. Duh. It is the different amounts
of time for different tasks that is the point. It is used to
work things out.
From another post Curt wrote:
> John believe humans are far more complex than just generic
> learning systems and that to get close to human behaviour
> we will need a lot more in the way of innate support
> hardware (but is normally fairly vague as to what the
> required extra hardware actually does).
The evidence is that the brain's learning system has a lot of
innate hardware at its disposal, which makes evolutionary sense
and is separate as to what is possible with learning machines.
I see no issue with investigating the possibility of a system
showing learning behavior with just raw input. My personal
view is the learning system of the human brain doesn't have
to learn about the raw input as evolution has provided networks
that do that automatically. So although I agree that we learn
ALL our high level behaviors I don't agree that we learn them
out of the raw data.
And it is my view that the few low level innate behaviors made
from raw data would be hard to learn compared with the millions
of possible high level humans behaviors.
I also think the human learning system is primed to learn from
the social environment which is itself evolving. Much of what
we know is taken not from our own learning efforts but from
the accumulated knowledge of this social system. Without this
social system we probably wouldn't be much smarter than a chimp
no matter how much raw data we had available.
JC
> I define making A decision as: the observable action.
>
> When the button is pressed THE decision has been made.
This is one definition, and a very good one in the limits of
behavioral psychology.
I would suggest one in the limits of neuroscience. A decision is
reached when the thalmic reticular nucleus is inhibited, allowing the
motor program to proceed that moves the arm to press the button.
> Our use of the word "innate" in this thread means "not learned". It means
> nature instead of nurture. It means a functional feature of the human body
> which developed (at least mostly) due to the internal physical structure of
> the body (most importantly but not limited to our genes) instead of in
> response to external features of the environment.
You are black-boxing the body, even as Casey black-boxes the brain. If
innate may be given any meaning, it must refer to the motor program
generators. These MPG’s can in many cases be given particular
locations in the brain. They are responsible for the totality of
muscle movement that we call behavior. Not some movement, but the
totality. We believe the MPG to be specific neural circuitry as it has
been found in invertebrates.
> We have been debating the question of how much, and what sort, of innate
> features we have to build into a machine in order to make it act like a
> human. My argument is the only really important technology we need to
> create to solve AI is strong generic learning. The rest of it will be
> simple and obvious by comparison once we have strong learning systems.
There is a weak synaptic strengthening following a successful outcome,
and a very strong one following a bad outcome. Following a bad outcome
there is a strengthening of recently fired synapses in the thalamic
reticular nucleus. If the bad situation arises again the TRN will be
activated and abort the suggested motor program.
> John believe humans are far more complex than just generic learning systems
> and that to get close to human beahvior we will need a lot more in the way
> of innate support hardware (but is normally fairly vague as to what the
> required extra hardware actually does).
There is nothing vague about the motor program generators. They are
very specific.
>> I define making A decision as: the observable action.
>>
>> When the button is pressed THE decision has been made.
>
>
> This is one definition, and a very good one in the
> limits of behavioral psychology.
>
> I would suggest one in the limits of neuroscience.
> A decision is reached when the thalamic reticular
> nucleus is inhibited, allowing the motor program
> to proceed that moves the arm to press the button.
However the word "decision" is a label for an observable
external behavior. As for a behavior to be released by
some internal switch that doesn't explain anything about
how the behavior was learned, how it was stored or how
it was selected or even what form it takes.
JC
Yeah there's lots of complexity there at many levels. It all very nicely
supports the sorts of ideas that John and others like to believe - that
intelligence is the result of lots of different complex systems working
together. It's really no wonder why people believe these things.
However, there are a few trumping facts about intelligence that few people
understand and most people choose to ignore. The prime fact is that our
intelligent behavior is all learned.
We don't know how to drive cars, or do computer programming, or be
engineers, or build buildings, or solve math problems, or speak a language,
or organize ourselves into a society, or even how to act rationally, at
birth. It's all learned. If you take away the stuff that's learned in a
human, what's left is about as intelligent as a rock.
And when you study the learning problem enough, you realize there can't be
20 different learning systems at work shaping our behavior - it just can't
work. It must be one global generic learning system. Not enough people
understand this becuase not enough people have looked hard enough at the
learning problem to understand it.
The reason we haven't solved the problem despite so many highly educated
and highly intelligent people working on it, is because generic learning of
the type used in the brain is simply very hard to duplicate. It's a type
of technology that's totally unlike the systems we use in our machines.
That's becuase our machines must be structured in ways that are easy for
us, as the designer, to understand and predict. The type of learning
system needed for AI is a machine structure which can't be understood by a
human. That is, once it's trained, it's too complex to understand what it
will do - just like no human could have manually set the weights of the
neural network that makes TD-Gammon a good backgammon player. As such,
this technology is not naturally on the path of the types of technology we
have been developing in the IT field for the past 70 years. Engineering
requires we build machines that are highly predictable. AI will be a
technology that will ceates machines we won't be able to to predict.
It's also a technology which by it's very nature, is nearly impossible to
reverse engineer. If you just studied TD-Gammon playing backgammon, you
couldn't in a thousand years reverse engineer the code that drive it (or,
at least, it would take a serious amount of work to do so if not actually a
thousand years). It's like trying to reverse engineer an encryption
program. That's just the nature of what type of system we are dealing
with.
Human behavior has the exact same characteristic. You simply CAN NOT
reverse engineer the exact specifics of the machine by studying human
behavior. You can pick up a lot of it's characteristics (as is done in
endless psychology experiments), but the very structure of the machine
simply does not show itself, in it's beahvior, just like encryption
algorithms don't show their internal structure in their external behavior.
It's the very nature of this type of learning machine that keeps itself
very well hidden. We could make far more progress if people had unlimited
access to the brain's of living humans to do experiments on. We would have
solved AI probably 40 years ago if we had the ability to do unlimited
experiments on living humans. But we are built not to harm humans, and as
such, won't do that.
So we are stuck trying to reverse engineer a system which happens to have a
nature that makes it nearly impossible to reverse engineer by studying it's
behavior, and stuck with very limited access to the inside of working
versions of this machine. It's nearly as impossible as trying to crack a
highly advanced encryption algorithm without having access to the inside of
the machine.
Though the brain is has lots of innate complexity, most that complexity is
not why we are intelligent. It's lots of extra optimizations and
implementation details that have been important for human survival over the
eons, but it's not why we are intelligent. We are intelligence, becuase we
have very strong generic learning hardware in us, and all the rest is of
only minor importance.
You are the only one I've really seen talking about MPG in the way you do
so I don't know how much this is well know facts about the human brain, or
how much is just crap you are making up.
However, regardless, either these MPGs must be learned, or they are simply
a set of standard, and uninteresting fixed beahviors (like raise right arm,
or reach out with right arm, or grasp object with fingers, etc.), of which
all other complex intelligent human beahviors are constructed by mixing and
matching the application of these MPG behaviors.
Either way, we still have the exact same generic learning problems in the
brain to be solved and that learning proram doesn't care what fixed
beahvior it has to pick from - it must have the strength to learn no matter
what fixed behaviors it has to pick from.
In short, the MPGs you keep talking about really have nothing to do with
creating machine intelligence, or, the are the learning technology, and
until you describe how a MPGs is created in response to conditioning (how
learning works), you haven't really said anything useful.
> > We have been debating the question of how much, and what sort, of
> > innate features we have to build into a machine in order to make it act
> > like a human. My argument is the only really important technology we
> > need to create to solve AI is strong generic learning. The rest of it
> > will be simple and obvious by comparison once we have strong learning
> > systems.
>
> There is a weak synaptic strengthening following a successful outcome,
> and a very strong one following a bad outcome. Following a bad outcome
> there is a strengthening of recently fired synapses in the thalamic
> reticular nucleus. If the bad situation arises again the TRN will be
> activated and abort the suggested motor program.
I just did a google on thalamic reticular nucleus and found nothing that
seems to match what you are suggesting here. The three web sites implied
1) no one really knows what it does and 2) it seems to be sensory gateway
to the brain and not something associated with the output side as you seem
to keep talking about.
> > John believe humans are far more complex than just generic learning
> > syste=
> ms
> > and that to get close to human beahvior we will need a lot more in the
> > wa=
> y
> > of innate support hardware (but is normally fairly vague as to what the
> > required extra hardware actually does).
>
> There is nothing vague about the motor program generators. They are
> very specific.
You have to explain how they work by creating modoels of the MPGs that
makes a machine act like an intelligent human. Otherwise, all you have
said is, "there is hardware that controls the behavior that is changed
based on whether the outcome was good or bad".
That implies the exact same thing I imply why I say intelligence is a
reinforcement learning system - which is a specification of what we have to
build, but tells us absolutely nothing about how to build it.