Divergence with SARSA and tile coding

342 views
Skip to first unread message

twan...@googlemail.com

unread,
Nov 4, 2010, 8:02:23 AM11/4/10
to Reinforcement Learning Mailing List
Hi,
i am wondering if anybody knows about divergence issues with SARSA in
combination with function approximation especially tile coding.

I found myself in the following situation:

1) Using a non-normalized net with only one Tiling results in no
divergence issues.
2) Using a non-normalized net with multiple Tiling results in
divergence issues.
3) Using a normalized net with one or multiple Tilings results in no
divergence issues at all.

I read that off-policy learning algorithms like Q-learning can lead to
divergence but nothing about divergence in the case of on-policy
learning algorithms like SARSA.

Any help is appreciated

Bye,
Thomas

Csaba Szepesvari

unread,
Nov 4, 2010, 10:48:15 AM11/4/10
to rl-...@googlegroups.com, twan...@googlemail.com
Hi Thomas,

Yes, the same issues arise in connection to SARSA and Q-learning, see, e.g.
G. Gordon. Chattering in Sarsa( ). CMU Learning Lab Internal Report.
Available at www.cs.cmu.edu/~ggordon, 1996.

Positive results about SARSA and function approximation are not many,
and come under stringent conditions, see e.g.

Perkins, T. and Precup, D. (2003). A convergent form of approximate
policy iteration. In S. Becker, S. T. and Obermayer, K., editors,
Advances in Neural Information Processing
Systems 15, pages 1595--1602, Cambridge, MA, USA. MIT Press.

Van Roy, B. (2006). Performance loss bounds for approximate value
iteration with state aggregation. Mathematics of Operations Research,
31(2):234--244.

Cheers,
- Csaba

twan...@googlemail.com

unread,
Nov 4, 2010, 11:07:02 AM11/4/10
to Reinforcement Learning Mailing List


On 4 Nov., 15:48, Csaba Szepesvari <szepe...@ualberta.ca> wrote:
> Hi Thomas,
>
> Yes, the same issues arise in connection to SARSA and Q-learning, see, e.g.
> G. Gordon. Chattering in Sarsa( ). CMU Learning Lab Internal Report.
> Available atwww.cs.cmu.edu/~ggordon, 1996.
>
> Positive results about SARSA and function approximation are not many,
> and come under stringent conditions, see e.g.
>

Thanks for the quick answer. After scanning through the paper i see
that the Q-functions can oscillate. Is it possible that the Q-
functions goes to infinity like in my case too?

Thanks again,
Thomas

Yaxin Liu

unread,
Nov 4, 2010, 11:38:44 AM11/4/10
to rl-...@googlegroups.com
Are you not discounting?

Thomas

--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en

Csaba Szepesvari

unread,
Nov 4, 2010, 11:39:59 AM11/4/10
to rl-...@googlegroups.com, Yaxin Liu
Hi,

On 10-11-04 9:38 AM, Yaxin Liu wrote:
> Are you not discounting?

It does not matter:)
The divergence is coming from the dynamics of the learning process.

- Csaba

>
> On Thu, Nov 4, 2010 at 8:07 AM, twan...@googlemail.com

> <mailto:twan...@googlemail.com> <twan...@googlemail.com


> <mailto:twan...@googlemail.com>> wrote:
>
>
>
> On 4 Nov., 15:48, Csaba Szepesvari <szepe...@ualberta.ca

> <mailto:szepe...@ualberta.ca>> wrote:
> > Hi Thomas,
> >
> > Yes, the same issues arise in connection to SARSA and Q-learning,
> see, e.g.
> > G. Gordon. Chattering in Sarsa( ). CMU Learning Lab Internal Report.
> > Available atwww.cs.cmu.edu/~ggordon

> <http://atwww.cs.cmu.edu/~ggordon>, 1996.


> >
> > Positive results about SARSA and function approximation are not many,
> > and come under stringent conditions, see e.g.
> >
>
> Thanks for the quick answer. After scanning through the paper i see
> that the Q-functions can oscillate. Is it possible that the Q-
> functions goes to infinity like in my case too?
>
> Thanks again,
> Thomas
>
> --
> You received this message because you are subscribed to the
> "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com

> <mailto:rl-...@googlegroups.com>


> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com

> <mailto:rl-list-u...@googlegroups.com>

Csaba Szepesvari

unread,
Nov 4, 2010, 11:39:23 AM11/4/10
to rl-...@googlegroups.com, twan...@googlemail.com
Hi Thomas,

yes, very much.
- Csaba


> Thanks again,
> Thomas
>

Alexander Hans

unread,
Nov 4, 2010, 11:58:48 AM11/4/10
to rl-...@googlegroups.com
Hi,

> On 10-11-04 9:38 AM, Yaxin Liu wrote:
>> Are you not discounting?
>
> It does not matter:)
> The divergence is coming from the dynamics of the learning process.

does this only happen in an online setting where one uses every single
observation to update the Q-function (as in classical Q-learning or Sarsa)
or can the divergence and chattering also occur in an batch-mode setting
where the set of observations does not change?


Cheers,

Alex


Frans Oliehoek

unread,
Nov 4, 2010, 11:59:34 AM11/4/10
to rl-...@googlegroups.com
Hi all,

Now I'm confused. I just thought I understood the chattering thing, but
apparently Sarsa with linear function approximation (that is what we are
talking about, right?) can also completely diverge? So when does it
chatter and when does it converge?

Sorry to be a nitwit (working on that...).

Best,
Frans

Hamid Reza Maei

unread,
Nov 4, 2010, 12:20:16 PM11/4/10
to rl-...@googlegroups.com
Hello Frans,

The connection between chattering and divergence could be seen in the way that the chattering domain can be extremely large. In other words, the result suggests that SARSA+LFA may chatter in a region but does not specify the size of this region. Maybe some bounds could be derived in some cases--I think one major factor is how rapidly the behavior policy changes--the slower, more convergent.

Cheers,
Hamid

twan...@googlemail.com

unread,
Nov 4, 2010, 12:28:22 PM11/4/10
to Reinforcement Learning Mailing List


On 4 Nov., 16:39, Csaba Szepesvari <szepe...@ualberta.ca> wrote:

> yes, very much.

Thanks again for the quick response. I spent very much time on this
issue and thought that it's an coding bug. Now i am convinced that's
most likely not the case (especially because one and the same code
works with PoleSwingup / Mountaincar / ... environments).

I observed that the divergence issue gets worse with the number of
overlapping (active) Tiles for a given state i.e. the number of
Tilings used. Do you know why?

Additionally do you have any idea why the divergence issue doesn't
arise if used in a normalized network setup?

In the non-divergence case i suggest it's possible that the Q-function
converges to a value bigger than the biggest reward received since
it's possible that the Q-function can show divergences,
oscillations, ... . Is this assumption correct?


Again thanks for your time and help,
Thomas

Rich Sutton

unread,
Nov 4, 2010, 12:12:59 PM11/4/10
to rl-...@googlegroups.com
With respect to Csaba, I would say that Thomas is right. There is a big difference between sarsa and q-learning for linear function approximation. For the latter there are known examples where it diverges, whereas for sarsa there are none. For sarsa there is not yet a complete useful convergence proof, but there is a proof on non-divergence. So any apparent example of divergence would be inconsistent with known theory. Csaba knows all this, but his remarks might have suggested otherwise.
R

Frans Oliehoek

unread,
Nov 4, 2010, 12:23:38 PM11/4/10
to rl-...@googlegroups.com
Aha, so you say it really doesn't diverge; it is just 'chattering the
wrong way for a long time'? I guess that makes sense. Thanks!
-Frans

twan...@googlemail.com

unread,
Nov 4, 2010, 1:09:50 PM11/4/10
to Reinforcement Learning Mailing List


On 4 Nov., 17:12, Rich Sutton <rsut...@ualberta.ca> wrote:
> With respect to Csaba, I would say that Thomas is right. There is a big difference between sarsa and q-learning for linear function approximation. For the latter there are known examples where it diverges, whereas for sarsa there are none. For sarsa there is not yet a complete useful convergence proof, but there is a proof on non-divergence. So any apparent example of divergence would be inconsistent with known theory. Csaba knows all this, but his remarks might have suggested otherwise.

I am a little bit confused now. Are you saying divergence isn't
possible for SARSA in combination with tile coding?

Thomas

Brian Tanner

unread,
Nov 4, 2010, 1:26:15 PM11/4/10
to rl-...@googlegroups.com
> I observed that the divergence issue gets worse with the number of
> overlapping (active) Tiles for a given state i.e. the number of
> Tilings used. Do you know why?

Apologies in case this is obvious to you already, but are you
adjusting your learning rate to accomodate the number of tilings? A
standard trick is to take your desired step size and divide it by the
number of tilings. In my personal experience "bad things" can happen
if the step size is too large, so if you're not accommodating for the
multiple tilings it could just be a case of that problem.


--
Brian Tanner
Ph.D Student
University of Alberta
br...@tannerpages.com

Csaba Szepesvari

unread,
Nov 4, 2010, 4:11:23 PM11/4/10
to rl-...@googlegroups.com, twan...@googlemail.com

No, he is saying we do not know: there is no specific example published
in the literature that would show the divergence of SARSA with linear
function approximation.

- Csaba

Frans Oliehoek

unread,
Nov 4, 2010, 4:22:15 PM11/4/10
to rl-...@googlegroups.com
Hi Csaba, Rich,

Like Tomas, I am a bit puzzled. To me it seems that the statements:

A) "For sarsa [with linear function approximation] there is not yet a

complete useful convergence proof, but there is a proof on non-divergence."

and

B) "Are you saying divergence isn't possible for SARSA in combination
with tile coding?"
C) "No, he is saying we do not know: there is no specific example

published in the literature that would show the divergence of SARSA with
linear function approximation."

are contradicting? The be completely clear: I am interpreting (A) as
saying "there is a proof that says SARSA with linear function
approximation does not diverge". Which would imply the answer "Yes" on
question (B) ?

Am I missing something?

Best,
Frans

Hado van Hasselt

unread,
Nov 4, 2010, 5:59:46 PM11/4/10
to rl-...@googlegroups.com
Okay, I'll give it a try. Please forgive me for blatant errors.

The update for tile coding is basically:

p'_{t+1} = p'_t + a*(T_t - p'_t f(s_t)) f'(s_t)

Here p_t is the parameter vector, f(s_t) is the feature vector at time
t, a is the learning rate and x' is the transpose of vector x. I'm
using T_t to denote the target, which may be a Monte Carlo sample, a
Sarsa temporal difference, a Q-learning temporal difference or
whatever. Then, the value at t+1 is

V_{t+1}(s_t) =
p'_{t+1} f(s_t) =
( p'_t + a*(T_t - p'_t f(s_t)) f'(s_t) ) f(s_t) =
V_t(s_t) + a*f'(s_t)*f(s_t)*(T_t - V_t(s_t))

It would seem to me that the effective learning rate (on the value,
not the parameters) then is a*f'(s_t)*f(s_t). This learning rate is
equal to N*a, where N is the number of tilings. In other words, if you
use multiple tilings, you might be using a learning rate that is
larger than one. With such a learning rate, it is no wonder that you
diverge and this does not have anything to do with chattering per se.

This seems to confirm Brian's intuition that dividing by the number of
tilings helps you out: if you divide the learning rate by the number
of tilings and therefore you keep it below 1/N, the effective learning
rate will be lower than one. It also might explain why indeed the
probability of divergence goes up if you use more tilings and why no
divergence occurs when you normalize.

I apologize for any important oversights, and I'm curious what you all think.

Regards,
Hado

Hamid Reza Maei

unread,
Nov 4, 2010, 6:02:10 PM11/4/10
to rl-...@googlegroups.com
Hello Frans,

As you pointed out, also my understanding is that SARSA with linear function approximation under on-policy training can not diverge. This does not mean that it converges to a fixed point; that is, and it may chatter between various policies and the learning parameter would be bounded in a region, and that region could be huge. Gordon, as Csaba mentioned, has a proof of "non-divergence" but his proof needs some more refinement and is not through (Csaba mentioned this to me quit a while ago).

But maybe Csaba can elaborate on the current status more.

Best,
Hamid

Hado van Hasselt

unread,
Nov 5, 2010, 5:03:25 AM11/5/10
to rl-...@googlegroups.com
By the way, my former argument only holds when you use non-decreasing
learning rates. Otherwise, the fact that a \to 0 should make sure that
you only use an effective learning rate larger than one on a finite
number of learning steps at the beginning. All convergence proofs that
I know about assume the Munro-Robbins properties of learning rates,
but in practice other (e.g., constant) learning rates are often used.
With constant learning rates sometimes convergence to a region can be
proven, but in other cases actual divergence of the parameters can
occur.

Thomas, can I ask what learning rate and how many tilings you are
using when you observe divergence?

One last remark: tile coding is one way to do linear function
approximation. However, it has the unfortunate side-effect of turning
almost any MDP into a POMDP. Usually this is not so harmful, but this
is a property of any discretization and can cause problems in some
settings. In general, at least it invalidates any convergence proof
that explicitly assumes that we are operating on an MDP. This does not
hold in general for linear function approximation: if you use a linear
function of real valued state space parameters without discretization,
you are still operating on an MDP (assuming that the original state
space parameters have the Markov property, of course).

Hope this makes sense. Otherwise, my apologies :)

Regards,
Hado

Alexander Hans

unread,
Nov 5, 2010, 5:17:58 AM11/5/10
to rl-...@googlegroups.com
> One last remark: tile coding is one way to do linear function
> approximation. However, it has the unfortunate side-effect of turning
> almost any MDP into a POMDP.

Good point!


> that explicitly assumes that we are operating on an MDP. This does not
> hold in general for linear function approximation: if you use a linear
> function of real valued state space parameters without discretization,
> you are still operating on an MDP (assuming that the original state
> space parameters have the Markov property, of course).

I assume this would also be true if we didn't deal with a linear function
of the state space parameters but of some (higher dimensional) feature
space, think basis functions (as used in least-squares policy iteration by
Lagoudakis & Parr) or kernel methods.


Unfortunately, so far no one said anything about my question: Does
everything that we're discussing here also hold in a batch-mode setting
where the set of observations doesn't change?


Thanks,

Alex


twan...@googlemail.com

unread,
Nov 5, 2010, 5:54:49 AM11/5/10
to Reinforcement Learning Mailing List


On 5 Nov., 10:03, Hado van Hasselt <hado.vanhass...@gmail.com> wrote:
> By the way, my former argument only holds when you use non-decreasing
> learning rates. Otherwise, the fact that a \to 0 should make sure that
> you only use an effective learning rate larger than one on a finite
> number of learning steps at the beginning. All convergence proofs that
> I know about assume the Munro-Robbins properties of learning rates,
> but in practice other (e.g., constant) learning rates are often used.
> With constant learning rates sometimes convergence to a region can be
> proven, but in other cases actual divergence of the parameters can
> occur.

Thanks, that's another way to deal with the divergence issue.

> Thomas, can I ask what learning rate and how many tilings you are
> using when you observe divergence?

My learning rate is constant, a = 0.2, and i observed that using at
least 8 tilings results in divergence. It seems that your argument is
correct. I assume that any discretization can lead to the same problem
if the sum of all overlapping (active) types of discretization forms
is bigger than 1.

Thanks again.

Hado van Hasselt

unread,
Nov 5, 2010, 5:55:29 AM11/5/10
to rl-...@googlegroups.com
Hi Alex,

>> that explicitly assumes that we are operating on an MDP. This does not
>> hold in general for linear function approximation: if you use a linear
>> function of real valued state space parameters without discretization,
>> you are still operating on an MDP (assuming that the original state
>> space parameters have the Markov property, of course).
>
> I assume this would also be true if we didn't deal with a linear function
> of the state space parameters but of some (higher dimensional) feature
> space, think basis functions (as used in least-squares policy iteration by
> Lagoudakis & Parr) or kernel methods.

Yes, as long as then the feature space has the Markov property, for
instance, because the features are injective functions of the state
parameters and these have the Markov property. The problem with tile
coding is that the tiling function is not injective.

> Unfortunately, so far no one said anything about my question: Does
> everything that we're discussing here also hold in a batch-mode setting
> where the set of observations doesn't change?

The problem with non-decreasing learning rates that effectively are
larger than 1 does hold in a batch setting. For instance, suppose you
have a problem with an effective learning rate of a*f'(s_t)*f(s_t) =
3, a state with initial value v_0 = 0 and an (average) obversed target
of T = 1 (for instance, this is an average of Monte Carlo samples for
the return from this state, or a deterministic reward leading to a
terminal state). Then:
v_0 = 0
v_1 = v0 + 3*(T - v0) = 3
v_2 = 3 + 3*(1-3) = -3
v_3 = -3 + 3*(1+3) = 9
v_4 = 9 + 3*(1-9) = -15
etc.
In general,
v_{t+1} = v_t + 3*(1 - v_t) = -2v_t + 3
and therefore
| v_{t+t} | >= 2|v_t| - 3
Incidentally, when v_0 = 0, you get v_t = (-1)^(t+1) 2^t. Divergence
occurs for any effective learning rate larger than 2 (as long as v_0
is not equal to T).

As before, you can prevent this diverge by making sure the effective
learning rate is lower than 1, or by decreasing the learning rate such
that this inevitably holds.

Best,
Hado

Alexander Hans

unread,
Nov 5, 2010, 6:14:02 AM11/5/10
to rl-...@googlegroups.com
>>> that explicitly assumes that we are operating on an MDP. This does not
>>> hold in general for linear function approximation: if you use a linear
>>> function of real valued state space parameters without discretization,
>>> you are still operating on an MDP (assuming that the original state
>>> space parameters have the Markov property, of course).
>> I assume this would also be true if we didn't deal with a linear
>> function of the state space parameters but of some (higher dimensional)
>> feature space, think basis functions (as used in least-squares policy
>> iteration by Lagoudakis & Parr) or kernel methods.
> Yes, as long as then the feature space has the Markov property, for
> instance,

Of course that's a requirement.


>> Unfortunately, so far no one said anything about my question: Does
>> everything that we're discussing here also hold in a batch-mode setting
>> where the set of observations doesn't change?
> The problem with non-decreasing learning rates that effectively are
> larger than 1 does hold in a batch setting. For instance, suppose you
> have a problem with an effective learning rate of a*f'(s_t)*f(s_t) =
> 3, a state with initial value v_0 = 0 and an (average) obversed target
> of T = 1 (for instance, this is an average of Monte Carlo samples for

[...]


> As before, you can prevent this diverge by making sure the effective
> learning rate is lower than 1, or by decreasing the learning rate such
> that this inevitably holds.

Thanks for the explanation, but I think that this was already clear to me.
My question was more in regards to chattering, which as we've learned can
look like divergence if the greedy region is large. I think this is
something else than the divergence introduced by large learning rates.

Gordon reported chattering for Sarsa, which obviously is an online method
where each new observation potentially changes the policy. Now my question
is whether this can also occur in a batch or offline setting, where for
instance the policy changes between iterations of a fitted policy
iteration method.


Cheers,

Alex


Nikos Vlassis

unread,
Nov 5, 2010, 8:12:54 AM11/5/10
to rl-...@googlegroups.com
Hi,

> Yes, as long as then the feature space has the Markov property, for
> instance, because the features are injective functions of the state
> parameters and these have the Markov property. The problem with tile
> coding is that the tiling function is not injective.

Any useful function approximation would transform the MDP to a POMDP,
since interesting features are never injective. Imagine Tetris with an
injective feature mapping: the feature space would consist of an
astronomical number of states, which would render feature extraction
useless.

> There is a big difference between sarsa and q-learning for linear function approximation. For the latter there are known examples where it diverges, whereas for sarsa there are none. For sarsa there is not yet a complete useful convergence proof, but there is a proof on non-divergence.

Is there really a proof of non-divergence? We know that a necessary
condition for convergence of TD/Sarsa(lambda) with linear architecture
is when the state visitation frequency matches the model stationary
distribution. The chattering effect in Sarsa is due to such
distributions changing abruptly due to greedy policy updates, but is
this proving non-divergence? It could be that there is a some MDP and
some actor architecture that "emulate" the conditions that cause
divergence of TD, hence establishing divergence of Sarsa too. So,
either I am not aware of the paper proving non-divergence of Sarsa, or
we don't have such a proof yet, or Thomas has found a counter-example.

Nikos

Hado van Hasselt

unread,
Nov 5, 2010, 9:14:47 AM11/5/10
to rl-...@googlegroups.com
> Any useful function approximation would transform the MDP to a POMDP,
> since interesting features are never injective. Imagine Tetris with an
> injective feature mapping: the feature space would consist of an
> astronomical number of states, which would render feature extraction
> useless.

I don't agree. For Tetris, maybe you are correct. But injective
mappings can be useful, especially in problems with continuous or
large ordinal state spaces rather than large nominal spaces. For
instance, in the cart pole or the mountain car, good (linear and
non-linear) controllers can be learned with the original state space
as input (i.e., the trivially injective identity mapping is used). See
for instance [1] (and many other papers).

Anyway, I'm not claiming one should always try to use an injective
feature mapping, just that one should be aware that otherwise you are
operating on a POMDP and you loose the MDP specific convergence
guarantees.

Best,
Hado

[1] Hado van Hasselt and Marco Wiering (2009). "Using Continuous
Action Spaces to Solve Discrete Problems". Proceedings of the
International Joint Conference on Neural Networks (IJCNN 2009),
Atlanta, GA, USA, 2009.

Thomas Dietterich

unread,
Nov 6, 2010, 7:08:54 AM11/6/10
to rl-...@googlegroups.com
In my experience, chattering can result from having the learning rate too
large.
The more stochastic the actions, the smaller the learning rates needs to be.
I've
worked on problems where even if SARSA was initialized with the correct
value function
a learning rate > 10^-5 would lead to a random walk away from the correct
value function.
A lot of RL toy problems have very little *relevant* stochasticity (i.e.,
stochastic
outcomes that lead to very different rewards).

--Tom

Thomas G. Dietterich Voice: 541-737-5559
School of EECS FAX: 541-737-1300
1148 Kelley Engineering Center URL: http://eecs.oregonstate.edu/~tgd
Oregon State University, Corvallis, OR 97331-5501

mt80

unread,
Dec 17, 2010, 4:49:17 AM12/17/10
to Reinforcement Learning Mailing List
I also agree with Thomas' statement. In my experiments with the
mountain-car problem, I had to set alpha=0.05 when using full
exploration with egreedy (e=1.0) in order to avoid divergence. For
learning I used off-policy Q-learning (without etraces) in combination
with tile-coding approximation of the value function (5 tilings, each
10x10).

Michel



On 6 Nov., 12:08, "Thomas Dietterich" <t...@eecs.oregonstate.edu>
wrote:
> > On Nov 4, 2010, at 9:39 AM, Csaba Szepesvari <szepe...@ualberta.ca>
> > wrote:
>
> > > Hi,
>
> > > On 10-11-04 9:38 AM, Yaxin Liu wrote:
> > >> Are you not discounting?
>
> > > It does not matter:)
> > > The divergence is coming from the dynamics of the learning process.
>
> > > - Csaba
>
> > >> On Thu, Nov 4, 2010 at 8:07 AM, twansc...@googlemail.com
> > >> <mailto:twansc...@googlemail.com> <twansc...@googlemail.com
Reply all
Reply to author
Forward
0 new messages