2009/12/3 "José Antonio Martín H." <
jama...@fdi.ucm.es>:
> Hi all.
>
> Does anybody knows of any research about solving "non renewable reward"
> problems?
Sorry no. I don't keep up with RL research so I can't help you there.
But I have a few comments about your thinking here.
The standard reinforcement learning abstraction of reward maximizing
can be made to fit I wide range of problems. I would say every
problem because I don't know of any it doesn't fit, but that would be
too broad a claim. You don't have to define a new type of
reinforcement learning in order to fit the type of problem you are
talking about. It's just wrong to think about it that way. You
simply define the reward signal to match your problem and then the
agent's goal remains the same as it always is in reinforcement
learning - which is reward maximizing.
> I mean that there is a limited total reward in the environment and every
> time the agent gets some reward the global level decrease.
That alone, as you specify it, doesn't change anything. The general
goal of the agent is still to get all the rewards as quickly as
possible. If there are 100 rewards in the environment and one RL
agent only gets 10 after interacting with the environment for hours,
and another agent gets all 100 in 10 seconds, the second agent is
clearly "better" than the first for that environment. This is just
the standard RL abstraction.
> One key point in this "sustainable" dynamic is that the Agent's
> objective is not just to get all the future cumulative reward but
> instead that the Agent only needs just some level of reward for which
> the agent can feel satisfied. So if the agent get higher reward that its
> level of satisfaction then it is squandering resources (reward).
If you want to implement a "feel satisfied" effect you do it by
changing the reward generator in the environment and not by changing
the RL abstraction itself to some new type of learning abstraction.
You can for example create an environment that has some type of token
that you want the agent to collect. We can call it food to parallel
what it seems like you are talking about here. We don't give the
agent a constant reward for every bit of food it collects (eats). If
we do, then the RL maximizing agent will eat as much as it can as fast
as it can without stopping. We instead could create something that
roughly parallels what life forms have to deal with - which is energy
collection and storage. Life forms store energy for later use, but
there's a limit to how much they can store. Actions consume energy
and when the agent runs low on energy it needs more.
Let me switch here from talking about life to talking about a robot
that we want to build which we will include an RL system so it can
adapt to it's environment.
We as the designers of the robot decide we want the robot to survive,
so we design it in a way that we think will maximize its odds of
survival. We create a reward signal that we think will make the odds
of survival easier for the reinforcement learning algorithm. Our
robot has a battery to store energy and the robot will need to
recharge it's battery to survive. We want the robot to learn to keep
its batteries charged so it will survive.
There are a lot ways to approach this problem in our design. We can
for example give it a reward signal that's a measure of the charge in
the battery. With a full charge, it will get a constant stream of
rewards. With a half-charge it will get half the rewards over time,
etc. With such a reward to deal with, the RL algorithm will in theory
learn to keep itself connected to the charger and to never move. If
it moves, it uses energy and gets less reward. If it disconnects from
the charger, its charge level will start to drop, and it will get less
rewards. The optimum (simple) behavior for this problem is for the
robot to learn to connect itself to the charger and never do anything
else.
Now, this robot won't be very "smart" about surviving a problem such
as a power failure because it's got not inherent motivation top ever
leave the home charger and go exploring and learning about the
environment - and learning for example that there are many other
places to plug in its charger. A robot that had gone exploring while
its battery was charged might have learned more about the environment,
so when that first power failure happened, it would already have
learned behaviors for driving next door and trying their power outlet.
A robot that is motivated to explore while it's got plenty of energy
might have a better chance of surviving. So we as the designs of the
robot have the option to change our design to motivate such behavior.
But we do this not by changing the RL abstraction, and not by changing
our learning algorithm in the robot, but instead, by changing the
reward signal we send the learning module. So from the perspective of
the learning module, we are changing the environment to make the robot
have different motivations.
We do it by adding your idea of "satisfied" to the design. Instead of
generating a reward signal based on charge level, we change the
hardware to create a reward signal based on rate of charge. Once the
batter is charged, the rate of charge goes to zero - and the rewards
for charging stop. If the robot sits there and does nothing, the
battery stays well charged, and the robot has to just wait for the
battery to drain by leakage before it can get more rewards for
charging. However, if it disconnects from the charger and starts
running in circles, it will drain the battery and then be able to get
more rewards, when it connects back to the charger. So now by
changing the reward signal, we have changed the orbots motivation from
"keep the battery charged" to "use as much energy as you can get".
And we have added the idea of "satisfied" to the design so it won't
just keep itself connected to the charger. If it gets a full battery,
it will stop consuming the limited research.
However, that design is actually worse about dealing with a limited
power resource because that second design will consumer as much
electricity as fast as it can. If the power suddenly runs out on the
robot, it will "die" shortly after that. But since the second design
is motivated to actually use its energy instead of conserve it, it's
more likely to develop exploring behaviors that could help it to
survive.
By making more adjustments to the reward signal generator (not to the
reinforcement learning algorithm we use or to the basic RL
abstraction), we can trade off the tendency of our robot to consume vs
conserve power to any level we like. We can keep the design for
giving it rewards for changing, but we can balance that with negative
rewards for using power in such a way to regulate it's typical usage
to some average level of power consumption that we feel is optimal for
the environment the robot will be trying to survive in. So if the
robot is on Mars and living off a limited solar cell power source, we
can regulate rewards so the robots typical behavior is to use only 70%
of the power we expect it to be able to get each day from the sun.
Building it that way motivates the robot to explore by regulates it's
use of power to fit what we expect to be available. We are
controlling in this way the robots tendency to trade off exploration
vs exploitation not by adjusting the learning algorithm, but by
instead, adjusting the reward signal we are sending to the learning
algorithm.
If you have a very very good learning algorithm, it will on its own,
given enough time, learn how to optimize its behavior to the
environment without us (acting as the designer) giving it "hints" by
designing a reward signal into the system that makes it naturally do
"what is right" for the environment. We can use a very simplistic
reward signal which gives it huge negative rewards for having it's
battery run down and let the RL algorithm figure everything else out.
But in this case, the robot would have to "die" (run out of power)
many many tine on it's own before it would be able to learn how to
prevent that from happening. If we were trying to make our robot
survive on its own on Mars, that wouldn't work, because we would have
to keep going up there and recharging its battery for years before it
learned how to do the right thing and conserve it's energy and
maximize its charge.
The stronger the learning algorithm, the less important the reward
signal becomes in terms of making it easy for the agent to learn to do
the right thing. But a weaker learning algorithm can be put to good
use in an agent if you build a lot of the "smarts" into the reward
signal instead of into the raw strength of the learning algorithm
itself.
All this happens however, without any need to define a new type of RL
abstraction, or a new type of reinforcement learning algoirthm. The
normal one always works. You don't need to re-define the interface
between the learning agent, and the reward generator to do any of
this. The interface is a reward signal which the agent tries to
maximize - end of story. You don't add "conserve your reward level to
x units of reward per time" into the interface definition and then
push that part of the requirement off on to the learning algorithm.
You keep that requirement implemented in the reward generator instead
and keep the work of the learning algorithm the same as always -
maximize the reward signal you are given.
All problems can be translated into RL problems like that. The
environment defines the problem, and the reward signal (from the
perspective of the learning algorithm) is part of the environment.
The reward signal is what defines the motivation of the learning agent
by translating environmental conditions into the reward signal which
the learning agent is trying to maximize.
Because all these problems can be translated into pure RL problems, we
have fractured the domain of learning in half by using the RL
abstraction. The first half of the problem is creating strong generic
RL learning algorithms with no a priori knowledge of what reward
signal or environment the algorithm will be used with. That's the
problem generally studied in RL research. The second half of the
problem is designing good reward signals for solving practical
problems using RL learning systems. This second half is where your
problem seems to lay in my view and it's one that is not typically
explored in RL research because it's really more of a practical
engineering problem than a theoretical machine learning problem.
By creating better RL algorithms, we are solving _all_ problems that
can be defined with the right reward generator.
> Of course there are some possible variants such as that the non
> renewable resources "reward" are renewed after some period of time and
> so on.
>
> Also the same situation with two competing agents seems to be very pretty.
You seem to be wandering off into what tends to be more of a game
theory area of study than RL.
When two RL agents compete against each other in the same environment
at the same time, then the first RL agent becomes part of the
environment the second RL agent is trying to understand and manipulate
and vice versa. The two agents typically each have their own reward
signal generators however and as such, are solving different problems.
They each for example might be trying to maximize the charge in their
own battery. Or they each might be trying to consume a maximum number
of food pellets. When their goals conflict, their learned behaviors
will naturally conflict (you get some limited form of what we could
call war between the agents).
If they both get the same reward signal, then they stop acting like
two completing agents, and start working together as if they were just
one agent.
When one RL system is trying to deal with an environment that has one
or more competing RL agents of the same power as part of the
environment, the problem becomes very difficult simply because the
environment becomes far more complex than the agent can possibly
model. One agent of a given power to model an environment can't hope
to model a complex environment, and another gent with the same sized
model at the same time. So you are looking at what happens when none
of the agents can fully "understand" what the other agents will do.
They have to use simplifying assumptions about the environment and
simply do the best they can. What sort of behaviors emerge as whole
from such systems of competing agents is a complex function of the
ability of the agents, and the motivations they have each been given,
and the nature of the rest of the environment they are interacting
with. It's generally so complex it's hard at times to find emergent
behaviors that even can be studied.
But my point in all this, is that pure RL research normally limits
itself to solving only the problem that is defined by the standard RL
abstraction of reward maximizing and it uses test environments that
are roughly suited to the learning power of the algorithm(s) being
tested. There is no need to redefine what RL is if you want to
explore the behavior and power of RL algorithms in a specific problem
domain created by one specific reward signal paired with one specific
environment. Defining a specific new problem domain doesn't change in
any way, what the RL algorithm is trying to do. It's still just
trying to discover the behaviors that work best for maximizing future
rewards.
> bests,
> -José