delayed rewards in RL

553 views
Skip to first unread message

Eduard

unread,
Jan 12, 2009, 1:17:12 PM1/12/09
to Reinforcement Learning Mailing List
I am working with RL, and i have a big concern regarding if my
algorithm can overcome a delayed reward. I am using the Mountain Car
code to develop my own application.

http://www.cs.ualberta.ca/~sutton/MountainCar/MountainCar.html.

In this example there is no delayed reward. In my work I get a reward
after 100 steps. My question is, using this algorithm will my program
be able to learn anything or should I change the framework?. Because
if I am moving from state to state, and in the state number 100 I get
all the punishment or reward without changing the value of the
previous states. I have the feeling that is going to be very hard for
the algorithm to learn a good policy.

If the Mountain Car is not suitable to develop my work can anyone
please recommend me another example to do my idea or any other
algorithm.

Thank you.

Carlos Diuk

unread,
Jan 12, 2009, 1:26:03 PM1/12/09
to rl-...@googlegroups.com
Hi Eduard,

Here's a recent paper that deals with this problem:
http://paul.rutgers.edu/~thomaswa/ecml07Delayed.pdf

-Carlos

Brian Tanner

unread,
Jan 12, 2009, 2:08:35 PM1/12/09
to rl-...@googlegroups.com
I haven't read the document that Carlos has linked to... but I'm not
sure I agree that there is no delayed reward in Mountain Car. Because
the reward for all actions in all states EXCEPT escaping the valley
give the same reward, Mountain Car very much has the issue of delayed
reward.

Maybe I understand it a different way though.

I will point out there there is an updated implementation of Mountain
Car, details here:
http://library.rl-community.org/environments/mountaincar

This works with RL-Glue, the cross language and cross platform
protocol for reinforcement learning:
http://glue.rl-community.org



--
Brian Tanner
Ph.D Student, University of Alberta
br...@tannerpages.com

"José Antonio Martín H."

unread,
Jan 12, 2009, 2:34:29 PM1/12/09
to rl-...@googlegroups.com
Hi Eduard,

In order to give you an answer could you please explain a little bit
more you question ?

At least for me, it is not clear the concept you handle of delayed
reward since in the standard MountainCar there is indeed a delayed reward.

best whishes,
jamh.

Eduard escribió:
--

====================================================
Jose Antonio Martin H.
Departamento de Sistemas Informáticos y Computación
Facultad de Informática
Universidad Complutense de Madrid
Ciudad universitaria, 28040 Madrid
====================================================

Eduard

unread,
Jan 12, 2009, 4:40:13 PM1/12/09
to Reinforcement Learning Mailing List
Hello, thanks for all and fast answers. Sorry I do not know what I was
thinking, Mountain Car is indeed a very good example for delayed
learning. But my concern is still there, in Mountain Car we treat all
the states the same, giving always the same reward -1. But in my case
I do not treat all of them the same. I am going to punish highly the
state number 100, anf after the state number 200, because in those
states is where I get the punishment. And the other states remain
without any feedback. My application is very similar to the Job
Scheduling Probem, after 100 schedulings is when I get how good was
the policy, that's why is in the state where I get the punishment/
reward.

What I am saying is that I have the feeling that for my application as
it is presented is going to be impossible for learn, is my intuition
right?. If it does what other solutions can I apply. I saw a post
about no many real application in RL, this is for a very real
application when my work will be over, I will be glad to share my
work.

Thanks for all the answers,

On Jan 12, 11:34 am, "José Antonio Martín H." <jamart...@fdi.ucm.es>
wrote:

Eduard

unread,
Jan 12, 2009, 4:42:42 PM1/12/09
to Reinforcement Learning Mailing List
Hello, thanks for all and fast answers. Sorry I do not know what I was
thinking, Mountain Car is indeed a very good example for delayed
learning. But my concern is still there, in Mountain Car we treat all
the states the same, giving always the same reward -1. But in my case
I do not treat all of them the same. I am going to punish highly the
state number 100, anf after the state number 200, because in those
states is where I get the punishment. And the other states remain
without any feedback. My application is very similar to the Job
Scheduling Probem, after 100 schedulings is when I get how good was
the policy, that's why is in the state where I get the punishment/
reward.

What I am saying is that I have the feeling that for my application as
it is presented is going to be impossible for learn, is my intuition
right?. If it does what other solutions can I apply. I saw a post
about no many real application in RL, this is for a very real
application when my work will be over, I will be glad to share my
work.

Thanks for all the answers,

On Jan 12, 11:34 am, "José Antonio Martín H." <jamart...@fdi.ucm.es>
wrote:

Michael Littman

unread,
Jan 13, 2009, 12:16:35 PM1/13/09
to rl-...@googlegroups.com
> But my concern is still there, in Mountain Car we treat all
> the states the same, giving always the same reward -1.

Except the goal states... so, you can think of it as zero reward
everywhere except for a positive reward for a goal state.  

I think I'm not understanding what you mean with your query.


Curt Welch

unread,
Jan 13, 2009, 2:00:55 PM1/13/09
to rl-...@googlegroups.com
If you know for a fact that the action which created the reward
happened _exactly_ 100 steps before the reward, then you need to apply
the reward (when it shows up) to the actions that were selected in the
state 100 steps back.

If you don't know which step was the cause of the reward, and it could
be any of the past 100 steps, then you have to use some strategy (such
as eligibility traces) to apply some of the reward to all the steps
that could have caused it. Even though you are applying the reward to
actions that didn't cause the reward, this can be made to average out
over time so that the "good" actions will statistically show their
true value in time. Delayed rewards don't stop the system from
learning (as long as you are applying some of the reward to the
actions that caused the reward), it simply adds more noise to the
system and slows down learning. However, the states and actions must
be statistically uncorrelated for the noise to be averaged out over
time.

I've not studied the mountain car code so I can't comment on how well
it will fit what you seem to be describing but it seemed (from the
comments) to be using eligibility traces which is the basic technique
for dealing with delayed rewards.

In your application, the trick is to take advantage of whatever you
know about how the environment works to simplify the learning problem.
If rewards always show up exactly 100 steps later and the 99 actions
you performed between there has no effect on the reward, then you
don't want to apply the reward to those 99 actions. You use an
eligibility trace that matches what you know about how the environment
works - one that applies no reward to last 99 actions, and the full
reward to the action performed 100 steps back - or whatever type of
profile is required based on how your environment works.

If you can share more specifics about your application, people here
can probably give you more specific answers on whether the mountain
car code applies, and what type of changes you would have to make for
it to work.

Curt

"José Antonio Martín H."

unread,
Jan 13, 2009, 2:42:44 PM1/13/09
to rl-...@googlegroups.com
Hi,

I am sorry but I think we are not helping too much to Eduard.

Learning from delayed rewards was the title of Watkins PhD work. and
indeed this is the classical situation in Reinforcement Learning (RL).
That is exactly what do the classical RL algorithms: to learn from a
reward signal that is not immediate to an action.

Such algorithms as Q-learning and SARSA converge with probability 1 to
an optimal policy in delayed rewards problems.

Given that: Why must Eduard use eligibility traces or think in some
intricate strategy to solve a delayed reward problem ?
(of course, eligibility traces help to improve convergence and may help
to speed up learning but also they can disturb when not used properly)

So why not to use simply SARSA or Q-Learning ?

For the first 99 steps without reward you can use an ad hoc "artificial
reward" signal, and so on for the rest of states where you do not
receive a "real reward".

You have to choose which artificial reward you will use, for instance,
-1 or 0. This will depend on the initialization of your Q-table.
If your Q-table is initialized with all Q(s,a)=0 then you should try to
avoid to use an "artificial reward" of zero since this will kill
exploration at initial episodes and may retard learning.

The fact is that the algorithm, be Q-learning or SARS, will tend to
select the 99 actions precedent to the reward time in such a way that
the total "true reward" be maximized.
This indeed means that the "true reward" R is a function of such
selected actions.

Now Eduard, think on that:
Don't you think that such "artificial reward" commented above is
precisely the "true reward" of the MountainCar problem for all the non
goal states ?

This is the concept of delayed reward in RL: a signal that tell your
algorithm in a time step (t) how well it has performed prior to that
time step (t) and in general you can`t know which state-action
combination originated such reward after you learn to estimate the
Q-value of a state-action pair.

Hope this helps..
jamh.



Curt Welch escribió:

Curt Welch

unread,
Jan 13, 2009, 11:11:37 PM1/13/09
to rl-...@googlegroups.com
On Tue, Jan 13, 2009 at 2:42 PM, "José Antonio Martín H."
<jama...@fdi.ucm.es> wrote:
>
> Hi,
>
> I am sorry but I think we are not helping too much to Eduard.
>
> Learning from delayed rewards was the title of Watkins PhD work. and
> indeed this is the classical situation in Reinforcement Learning (RL).
> That is exactly what do the classical RL algorithms: to learn from a
> reward signal that is not immediate to an action.
>
> Such algorithms as Q-learning and SARSA converge with probability 1 to
> an optimal policy in delayed rewards problems.
>
> Given that: Why must Eduard use eligibility traces or think in some
> intricate strategy to solve a delayed reward problem ?
> (of course, eligibility traces help to improve convergence and may help
> to speed up learning but also they can disturb when not used properly)
>
> So why not to use simply SARSA or Q-Learning ?

Ah, good point. My thinking is a bit flawed (not unusual for me).

I spend a lot of time thinking about high dimension RL problems where
the state signal doesn't have the Markov property. It tends to bias
my thinking about RL in general. I don't claim to have a good
understanding of the domain even though It's the domain I'm interested
in. Your point has made me do a bit more thinking (and learning).

If you have a state signal with the Markov property, and the state
space is small enough that you can implement any of the standard
algorithms (which require a value array as large as the state space
(or as large as the state/action pair space)), then yes, they do all
solve the delayed reward problem. It's only a question of how much
training is required based on how large the state space is. Multistep
backups are not required to make it converge (as you point out), but
it can speed up the convergence.

If the state signal has the Markov property, then by definition, there
is no delay between the state signal, and the reward. There is only a
delay between the actions, and the reward. That is, the RL algorithm
has to learn the correct path through the state space to get the
reward, but the reward is associated directly (with no delay) based on
the state signal. If the state signal doesn't have the Markov
property, and there is a delay between the perceived (non Markov)
state and the reward, then the standard TD algorithms won't always be
able to solve the problem.

For example, lets just assume we have a binary sensory input, and a
binary action decision for each input. The environment will give us a
reward of 1 if we produce the output action which matches the input
sensory signal. But the reward is delayed, by 10 time steps. So the
reward we receive after each action, is a measure of what we did 10
steps in the past, and not what we just did.

The "state" input we have is only binary - it gives the impression of
a two-state environment. But the environment is actually more complex
than that. To create an internal state signal that has the Markov
property, we have to remember the last 10 inputs and the last 10
actions we selected, and use all that information to define the true
state of the environment. So that gives us not a 1 bit environment
with 2 states, but a 20 bit environment with 2^20 (1048576) states.

Using this improved 20 bit state signal and a state space with 1048576
states, we can now use any of the TD algorithms to solve this learning
problem. But the state space is so large that it will take a very
long for this policy array to converge on the correct answer (and use
up a few megabytes of memory for that value array).

However, the same problem can be solved with RL using the 1 bit
sensory data and a simple 2 state space environment model by combining
it with an eligibility trace that goes back 10 time steps. Each
reward gets partially applied to the last 10 state/action pairs
visited. Though the last 9 state/actions have nothing to do with why
the reward showed up, the one state/action (10 steps back in time)
correlates perfectly with the reward value. Statistically, over time,
this correlation would surface allowing the system to learn what
action to create for the two states. But I believe there's a
requirement about the correlations between actions and states for this
to be solvable. I think I need to write some code and test this as an
education project for myself....

So the real question for Eduard is whether he's able to create a state
signal that has the Markov property for his application and whether in
doing that, the state space is small enough to be practical for
whatever he is trying to do. The reward can be delayed from the
actions used to reach the reward state, and the standard TD methods
will solve it. But if the state signal doesn't predict the reward
without delay, then the state signal doesn't have the Markov property,
and standard TD methods might not (or might) be able to solve it.

At least I think that's the correct issue to look at here. I could
easily be wrong. :)

I assume the mountain car problem has been specified so that the state
signal does have the Markov property (I'll have to look at that) - but
does the problem Eduard is looking at have a state signal with the
Markov property or does the reward delay he's thinking about mean that
his sensory signals don't have the Markov property on their own?

Curt

Eduard

unread,
Jan 14, 2009, 11:35:49 AM1/14/09
to Reinforcement Learning Mailing List
Hello thanks for all the answers. I have not forgotten this I am still
analyzing all the answers and their relation with my problem. So maybe
it takes me a couple of days until I can say something meaningful
again.

Thanks again.

Eduard

unread,
Jan 14, 2009, 11:51:36 AM1/14/09
to Reinforcement Learning Mailing List
Hello thanks for all the answers. I have not forgotten this I am still
analyzing all the answers and their relation with my problem. So maybe
it takes me a couple of days until I can say something meaningful
again.

Thanks again.

On Jan 13, 8:11 pm, "Curt Welch" <curt.we...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages