If you don't know which step was the cause of the reward, and it could
be any of the past 100 steps, then you have to use some strategy (such
as eligibility traces) to apply some of the reward to all the steps
that could have caused it. Even though you are applying the reward to
actions that didn't cause the reward, this can be made to average out
over time so that the "good" actions will statistically show their
true value in time. Delayed rewards don't stop the system from
learning (as long as you are applying some of the reward to the
actions that caused the reward), it simply adds more noise to the
system and slows down learning. However, the states and actions must
be statistically uncorrelated for the noise to be averaged out over
time.
I've not studied the mountain car code so I can't comment on how well
it will fit what you seem to be describing but it seemed (from the
comments) to be using eligibility traces which is the basic technique
for dealing with delayed rewards.
In your application, the trick is to take advantage of whatever you
know about how the environment works to simplify the learning problem.
If rewards always show up exactly 100 steps later and the 99 actions
you performed between there has no effect on the reward, then you
don't want to apply the reward to those 99 actions. You use an
eligibility trace that matches what you know about how the environment
works - one that applies no reward to last 99 actions, and the full
reward to the action performed 100 steps back - or whatever type of
profile is required based on how your environment works.
If you can share more specifics about your application, people here
can probably give you more specific answers on whether the mountain
car code applies, and what type of changes you would have to make for
it to work.
Curt
Ah, good point. My thinking is a bit flawed (not unusual for me).
I spend a lot of time thinking about high dimension RL problems where
the state signal doesn't have the Markov property. It tends to bias
my thinking about RL in general. I don't claim to have a good
understanding of the domain even though It's the domain I'm interested
in. Your point has made me do a bit more thinking (and learning).
If you have a state signal with the Markov property, and the state
space is small enough that you can implement any of the standard
algorithms (which require a value array as large as the state space
(or as large as the state/action pair space)), then yes, they do all
solve the delayed reward problem. It's only a question of how much
training is required based on how large the state space is. Multistep
backups are not required to make it converge (as you point out), but
it can speed up the convergence.
If the state signal has the Markov property, then by definition, there
is no delay between the state signal, and the reward. There is only a
delay between the actions, and the reward. That is, the RL algorithm
has to learn the correct path through the state space to get the
reward, but the reward is associated directly (with no delay) based on
the state signal. If the state signal doesn't have the Markov
property, and there is a delay between the perceived (non Markov)
state and the reward, then the standard TD algorithms won't always be
able to solve the problem.
For example, lets just assume we have a binary sensory input, and a
binary action decision for each input. The environment will give us a
reward of 1 if we produce the output action which matches the input
sensory signal. But the reward is delayed, by 10 time steps. So the
reward we receive after each action, is a measure of what we did 10
steps in the past, and not what we just did.
The "state" input we have is only binary - it gives the impression of
a two-state environment. But the environment is actually more complex
than that. To create an internal state signal that has the Markov
property, we have to remember the last 10 inputs and the last 10
actions we selected, and use all that information to define the true
state of the environment. So that gives us not a 1 bit environment
with 2 states, but a 20 bit environment with 2^20 (1048576) states.
Using this improved 20 bit state signal and a state space with 1048576
states, we can now use any of the TD algorithms to solve this learning
problem. But the state space is so large that it will take a very
long for this policy array to converge on the correct answer (and use
up a few megabytes of memory for that value array).
However, the same problem can be solved with RL using the 1 bit
sensory data and a simple 2 state space environment model by combining
it with an eligibility trace that goes back 10 time steps. Each
reward gets partially applied to the last 10 state/action pairs
visited. Though the last 9 state/actions have nothing to do with why
the reward showed up, the one state/action (10 steps back in time)
correlates perfectly with the reward value. Statistically, over time,
this correlation would surface allowing the system to learn what
action to create for the two states. But I believe there's a
requirement about the correlations between actions and states for this
to be solvable. I think I need to write some code and test this as an
education project for myself....
So the real question for Eduard is whether he's able to create a state
signal that has the Markov property for his application and whether in
doing that, the state space is small enough to be practical for
whatever he is trying to do. The reward can be delayed from the
actions used to reach the reward state, and the standard TD methods
will solve it. But if the state signal doesn't predict the reward
without delay, then the state signal doesn't have the Markov property,
and standard TD methods might not (or might) be able to solve it.
At least I think that's the correct issue to look at here. I could
easily be wrong. :)
I assume the mountain car problem has been specified so that the state
signal does have the Markov property (I'll have to look at that) - but
does the problem Eduard is looking at have a state signal with the
Markov property or does the reward delay he's thinking about mean that
his sensory signals don't have the Markov property on their own?
Curt