Yes, the same issues arise in connection to SARSA and Q-learning, see, e.g.
G. Gordon. Chattering in Sarsa( ). CMU Learning Lab Internal Report.
Available at www.cs.cmu.edu/~ggordon, 1996.
Positive results about SARSA and function approximation are not many,
and come under stringent conditions, see e.g.
Perkins, T. and Precup, D. (2003). A convergent form of approximate
policy iteration. In S. Becker, S. T. and Obermayer, K., editors,
Advances in Neural Information Processing
Systems 15, pages 1595--1602, Cambridge, MA, USA. MIT Press.
Van Roy, B. (2006). Performance loss bounds for approximate value
iteration with state aggregation. Mathematics of Operations Research,
31(2):234--244.
Cheers,
- Csaba
Thomas
--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en
On 10-11-04 9:38 AM, Yaxin Liu wrote:
> Are you not discounting?
It does not matter:)
The divergence is coming from the dynamics of the learning process.
- Csaba
>
> On Thu, Nov 4, 2010 at 8:07 AM, twan...@googlemail.com
> <mailto:twan...@googlemail.com> <twan...@googlemail.com
> <mailto:twan...@googlemail.com>> wrote:
>
>
>
> On 4 Nov., 15:48, Csaba Szepesvari <szepe...@ualberta.ca
> <mailto:szepe...@ualberta.ca>> wrote:
> > Hi Thomas,
> >
> > Yes, the same issues arise in connection to SARSA and Q-learning,
> see, e.g.
> > G. Gordon. Chattering in Sarsa( ). CMU Learning Lab Internal Report.
> > Available atwww.cs.cmu.edu/~ggordon
> <http://atwww.cs.cmu.edu/~ggordon>, 1996.
> >
> > Positive results about SARSA and function approximation are not many,
> > and come under stringent conditions, see e.g.
> >
>
> Thanks for the quick answer. After scanning through the paper i see
> that the Q-functions can oscillate. Is it possible that the Q-
> functions goes to infinity like in my case too?
>
> Thanks again,
> Thomas
>
> --
> You received this message because you are subscribed to the
> "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> <mailto:rl-...@googlegroups.com>
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> <mailto:rl-list-u...@googlegroups.com>
yes, very much.
- Csaba
> Thanks again,
> Thomas
>
> On 10-11-04 9:38 AM, Yaxin Liu wrote:
>> Are you not discounting?
>
> It does not matter:)
> The divergence is coming from the dynamics of the learning process.
does this only happen in an online setting where one uses every single
observation to update the Q-function (as in classical Q-learning or Sarsa)
or can the divergence and chattering also occur in an batch-mode setting
where the set of observations does not change?
Cheers,
Alex
Now I'm confused. I just thought I understood the chattering thing, but
apparently Sarsa with linear function approximation (that is what we are
talking about, right?) can also completely diverge? So when does it
chatter and when does it converge?
Sorry to be a nitwit (working on that...).
Best,
Frans
The connection between chattering and divergence could be seen in the way that the chattering domain can be extremely large. In other words, the result suggests that SARSA+LFA may chatter in a region but does not specify the size of this region. Maybe some bounds could be derived in some cases--I think one major factor is how rapidly the behavior policy changes--the slower, more convergent.
Cheers,
Hamid
Apologies in case this is obvious to you already, but are you
adjusting your learning rate to accomodate the number of tilings? A
standard trick is to take your desired step size and divide it by the
number of tilings. In my personal experience "bad things" can happen
if the step size is too large, so if you're not accommodating for the
multiple tilings it could just be a case of that problem.
--
Brian Tanner
Ph.D Student
University of Alberta
br...@tannerpages.com
No, he is saying we do not know: there is no specific example published
in the literature that would show the divergence of SARSA with linear
function approximation.
- Csaba
Like Tomas, I am a bit puzzled. To me it seems that the statements:
A) "For sarsa [with linear function approximation] there is not yet a
complete useful convergence proof, but there is a proof on non-divergence."
and
B) "Are you saying divergence isn't possible for SARSA in combination
with tile coding?"
C) "No, he is saying we do not know: there is no specific example
published in the literature that would show the divergence of SARSA with
linear function approximation."
are contradicting? The be completely clear: I am interpreting (A) as
saying "there is a proof that says SARSA with linear function
approximation does not diverge". Which would imply the answer "Yes" on
question (B) ?
Am I missing something?
Best,
Frans
The update for tile coding is basically:
p'_{t+1} = p'_t + a*(T_t - p'_t f(s_t)) f'(s_t)
Here p_t is the parameter vector, f(s_t) is the feature vector at time
t, a is the learning rate and x' is the transpose of vector x. I'm
using T_t to denote the target, which may be a Monte Carlo sample, a
Sarsa temporal difference, a Q-learning temporal difference or
whatever. Then, the value at t+1 is
V_{t+1}(s_t) =
p'_{t+1} f(s_t) =
( p'_t + a*(T_t - p'_t f(s_t)) f'(s_t) ) f(s_t) =
V_t(s_t) + a*f'(s_t)*f(s_t)*(T_t - V_t(s_t))
It would seem to me that the effective learning rate (on the value,
not the parameters) then is a*f'(s_t)*f(s_t). This learning rate is
equal to N*a, where N is the number of tilings. In other words, if you
use multiple tilings, you might be using a learning rate that is
larger than one. With such a learning rate, it is no wonder that you
diverge and this does not have anything to do with chattering per se.
This seems to confirm Brian's intuition that dividing by the number of
tilings helps you out: if you divide the learning rate by the number
of tilings and therefore you keep it below 1/N, the effective learning
rate will be lower than one. It also might explain why indeed the
probability of divergence goes up if you use more tilings and why no
divergence occurs when you normalize.
I apologize for any important oversights, and I'm curious what you all think.
Regards,
Hado
As you pointed out, also my understanding is that SARSA with linear function approximation under on-policy training can not diverge. This does not mean that it converges to a fixed point; that is, and it may chatter between various policies and the learning parameter would be bounded in a region, and that region could be huge. Gordon, as Csaba mentioned, has a proof of "non-divergence" but his proof needs some more refinement and is not through (Csaba mentioned this to me quit a while ago).
But maybe Csaba can elaborate on the current status more.
Best,
Hamid
Thomas, can I ask what learning rate and how many tilings you are
using when you observe divergence?
One last remark: tile coding is one way to do linear function
approximation. However, it has the unfortunate side-effect of turning
almost any MDP into a POMDP. Usually this is not so harmful, but this
is a property of any discretization and can cause problems in some
settings. In general, at least it invalidates any convergence proof
that explicitly assumes that we are operating on an MDP. This does not
hold in general for linear function approximation: if you use a linear
function of real valued state space parameters without discretization,
you are still operating on an MDP (assuming that the original state
space parameters have the Markov property, of course).
Hope this makes sense. Otherwise, my apologies :)
Regards,
Hado
Good point!
> that explicitly assumes that we are operating on an MDP. This does not
> hold in general for linear function approximation: if you use a linear
> function of real valued state space parameters without discretization,
> you are still operating on an MDP (assuming that the original state
> space parameters have the Markov property, of course).
I assume this would also be true if we didn't deal with a linear function
of the state space parameters but of some (higher dimensional) feature
space, think basis functions (as used in least-squares policy iteration by
Lagoudakis & Parr) or kernel methods.
Unfortunately, so far no one said anything about my question: Does
everything that we're discussing here also hold in a batch-mode setting
where the set of observations doesn't change?
Thanks,
Alex
>> that explicitly assumes that we are operating on an MDP. This does not
>> hold in general for linear function approximation: if you use a linear
>> function of real valued state space parameters without discretization,
>> you are still operating on an MDP (assuming that the original state
>> space parameters have the Markov property, of course).
>
> I assume this would also be true if we didn't deal with a linear function
> of the state space parameters but of some (higher dimensional) feature
> space, think basis functions (as used in least-squares policy iteration by
> Lagoudakis & Parr) or kernel methods.
Yes, as long as then the feature space has the Markov property, for
instance, because the features are injective functions of the state
parameters and these have the Markov property. The problem with tile
coding is that the tiling function is not injective.
> Unfortunately, so far no one said anything about my question: Does
> everything that we're discussing here also hold in a batch-mode setting
> where the set of observations doesn't change?
The problem with non-decreasing learning rates that effectively are
larger than 1 does hold in a batch setting. For instance, suppose you
have a problem with an effective learning rate of a*f'(s_t)*f(s_t) =
3, a state with initial value v_0 = 0 and an (average) obversed target
of T = 1 (for instance, this is an average of Monte Carlo samples for
the return from this state, or a deterministic reward leading to a
terminal state). Then:
v_0 = 0
v_1 = v0 + 3*(T - v0) = 3
v_2 = 3 + 3*(1-3) = -3
v_3 = -3 + 3*(1+3) = 9
v_4 = 9 + 3*(1-9) = -15
etc.
In general,
v_{t+1} = v_t + 3*(1 - v_t) = -2v_t + 3
and therefore
| v_{t+t} | >= 2|v_t| - 3
Incidentally, when v_0 = 0, you get v_t = (-1)^(t+1) 2^t. Divergence
occurs for any effective learning rate larger than 2 (as long as v_0
is not equal to T).
As before, you can prevent this diverge by making sure the effective
learning rate is lower than 1, or by decreasing the learning rate such
that this inevitably holds.
Best,
Hado
Of course that's a requirement.
>> Unfortunately, so far no one said anything about my question: Does
>> everything that we're discussing here also hold in a batch-mode setting
>> where the set of observations doesn't change?
> The problem with non-decreasing learning rates that effectively are
> larger than 1 does hold in a batch setting. For instance, suppose you
> have a problem with an effective learning rate of a*f'(s_t)*f(s_t) =
> 3, a state with initial value v_0 = 0 and an (average) obversed target
> of T = 1 (for instance, this is an average of Monte Carlo samples for
[...]
> As before, you can prevent this diverge by making sure the effective
> learning rate is lower than 1, or by decreasing the learning rate such
> that this inevitably holds.
Thanks for the explanation, but I think that this was already clear to me.
My question was more in regards to chattering, which as we've learned can
look like divergence if the greedy region is large. I think this is
something else than the divergence introduced by large learning rates.
Gordon reported chattering for Sarsa, which obviously is an online method
where each new observation potentially changes the policy. Now my question
is whether this can also occur in a batch or offline setting, where for
instance the policy changes between iterations of a fitted policy
iteration method.
Cheers,
Alex
> Yes, as long as then the feature space has the Markov property, for
> instance, because the features are injective functions of the state
> parameters and these have the Markov property. The problem with tile
> coding is that the tiling function is not injective.
Any useful function approximation would transform the MDP to a POMDP,
since interesting features are never injective. Imagine Tetris with an
injective feature mapping: the feature space would consist of an
astronomical number of states, which would render feature extraction
useless.
> There is a big difference between sarsa and q-learning for linear function approximation. For the latter there are known examples where it diverges, whereas for sarsa there are none. For sarsa there is not yet a complete useful convergence proof, but there is a proof on non-divergence.
Is there really a proof of non-divergence? We know that a necessary
condition for convergence of TD/Sarsa(lambda) with linear architecture
is when the state visitation frequency matches the model stationary
distribution. The chattering effect in Sarsa is due to such
distributions changing abruptly due to greedy policy updates, but is
this proving non-divergence? It could be that there is a some MDP and
some actor architecture that "emulate" the conditions that cause
divergence of TD, hence establishing divergence of Sarsa too. So,
either I am not aware of the paper proving non-divergence of Sarsa, or
we don't have such a proof yet, or Thomas has found a counter-example.
Nikos
I don't agree. For Tetris, maybe you are correct. But injective
mappings can be useful, especially in problems with continuous or
large ordinal state spaces rather than large nominal spaces. For
instance, in the cart pole or the mountain car, good (linear and
non-linear) controllers can be learned with the original state space
as input (i.e., the trivially injective identity mapping is used). See
for instance [1] (and many other papers).
Anyway, I'm not claiming one should always try to use an injective
feature mapping, just that one should be aware that otherwise you are
operating on a POMDP and you loose the MDP specific convergence
guarantees.
Best,
Hado
[1] Hado van Hasselt and Marco Wiering (2009). "Using Continuous
Action Spaces to Solve Discrete Problems". Proceedings of the
International Joint Conference on Neural Networks (IJCNN 2009),
Atlanta, GA, USA, 2009.
--Tom
Thomas G. Dietterich Voice: 541-737-5559
School of EECS FAX: 541-737-1300
1148 Kelley Engineering Center URL: http://eecs.oregonstate.edu/~tgd
Oregon State University, Corvallis, OR 97331-5501