Question about Q-Learning in large state space

4,668 views
Skip to first unread message

Shashank Bhatia

unread,
Jun 13, 2011, 7:45:53 PM6/13/11
to rl-...@googlegroups.com
Hi,

I am trying to solve a problem which has a huge state space, using Q-Learning.

Let me give you an idea of the large state-space:

There are 5 variables in the state space, a summary of their ranges and the discretization i am trying to use is described here:

State Variable                        :          S1                 S2              S3                  S4         S5
Range of Values                    :      [0,50]             [-25,25]         [0,360]             [0,1]      [0,1]
Resolution used                     :        5                   5                  2                     0.1        0.1
Total no. of discretised values:        10                    10               180                 10          10

Target State is defined with some specific values of these state variables.                         

Now this problem is having 5 continuous variables, and I have chosen appropriate resolution to discretize these state variables.

The Action space is simple, it contains 8 actions which can increase/decrease three action variables A1,A2,A3


With this huge state space, I am wondering if at all Q-Learning will be successful ? Will it be able to learn a policy at all? 


I was wondering how can I store Q-Values of such a huge state-action space ? 

I thought a Radial Basis function would help, but when I try to implement it, I am unable to understand how different variables with different resolution can be captured in a Radial Basis function?


Can you please direct me to appropriate techniques/examples/code etc ?

Thanks and regards,

Shashank

  

Hado van Hasselt

unread,
Jun 14, 2011, 1:51:18 AM6/14/11
to rl-...@googlegroups.com
Hi Shashank,

In principle, the Q-learning algorithm itself can handle large state
spaces. However, in your case the discretized state-action space is so
large (> 20 million possible state-action combinations) that it is
likely that you will need far too many samples to learn anything
useful.

I would suggest that you do not discretize the state space. You will
then need to choose a suitable function approximation technique. In
reinforcement learning, an often used technique is to evenly place a
number of radial basis functions or triangular functions throughout
the state space. These functions then can be interpreted as (local)
features. The Q-value of each action can then be approximated by a
linear combination of these features. For more in depth information
about this, you could check out the book by Busoniu et al. [1]. If you
have problems in your setting to specify RBFs for dimensions with
different resolutions, does it help if you first scale all the
dimensions (e.g., scale all of them to [0,1])?

Alternatively, you could use a non-linear function approximation
technique directly on the state variables. This comes with less
convergence guarantees than the former method, but it may work well in
practice. You can get some C++ code that implements Q-learning
(amongst other algorithms) with tables (for discrete problems) and
with neural networks (for continuous problems) from my homepage [2].
The neural network code is also included.

I am a bit curious about your action space by the way. If there are
three 'action variables' that can be increased or decreased, should
the current values not be included in the state space? Do I understand
correctly that you can not set the action variables directly, instead
of increasing or decreasing them?

[1] http://www.dcsc.tudelft.nl/rlbook/
[2] http://homepages.cwi.nl/~hasselt/code.html

Hope this helps!

Best,
Hado

> --
> You received this message because you are subscribed to the "Reinforcement
> Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en

brunocs

unread,
Jun 14, 2011, 8:17:41 AM6/14/11
to Reinforcement Learning Mailing List
Dear Shashank,

There are several methods for doing value function approximation in
RL, and the main differences between them are related to what basis
functions you use. You could try, for instance, the Fourier Basis --
this corresponds to a fairly simple method, very easy to understand
and to code. The latest paper on it is called "Value Function
Approximation in Reinforcement Learning using the Fourier Basis", by
Konidaris et al (http://people.csail.mit.edu/gdk/pubs/fourier.pdf). If
I'm not mistaken, they also provide Java code for it.

Hope that helps,
Bruno

Shashank Bhatia

unread,
Jun 14, 2011, 2:08:47 AM6/14/11
to rl-...@googlegroups.com
Hi  Hado,

Thanks for the detailed response and pointers! They are definitely helpful.



To give you some more details about the problem:

The actions A1,A2,A3 control the agent's movement (by changing translation speed, direction of motion, and rotational speed)

The states S1,S2,S3 describe current position of the agent relative from fixed goal position.

State S4, S5 are nothing but current values of A1 and A3 (translation speed, and rotation speed)

The goal of agent is to reach specific s1,s2,s3, with a1=s4=0 and a3=s5=0.

Currently I am deliberating on using CMAC for the function approximation.



To clarify more about the action space:

I am albe to set the values of 3 parameters directly, however since they are also continuous, it only increases the complexity of learning process.

Therefore I chose to use increase/decrease type actions for each action variable. Thereby leaving only 8 actions.

You are correct about the values of action variables included in state space. 

Yes they are included in the form of S3, and S4. I might also consider including a combination S3*S4 to further reduce the space.



Thanks again,

Cheers, 

Shashank

Hado van Hasselt

unread,
Jun 17, 2011, 5:36:21 AM6/17/11
to rl-...@googlegroups.com
Hi Shashank,

CMAC also discretizes the state space, but it does so more efficiently
than a naive discretization. If you can generate enough samples, this
may work fine. Discretizing the state and action space as you propose
(for instance by using CMAC as function approximation), is in any case
a good start and may yield good enough results, depending on your
problem.

If you would like to look into this, you might get better results if
you choose an action directly. This would simplify your state space,
because dimensions S4 and S5 are no longer needed, but as you already
noted, it would complicate your action space. However, there are some
good algorithms for problems with continuous action spaces. I've
gotten some good results with an actor-critic algorithm called Cacla
[1] myself. This algorithm allows you to train a function approximator
to output a continuous action for each state. I've used it with neural
networks and with linear controllers, with promising results. Of
course, your mileage may differ. Other algorithms that can be used for
such continuous-action settings include other actor-critic algorithms,
such as the class of natural actor-critic algorithms [2,3].

I am currently writing up a surveying chapter on
reinforcement-learning methods in continuous spaces, that includes
discussions on methods for problems with continuous actions. It also
includes a short discussion on CMAC (or tile coding, as it is also
called). If you want, I could send you a (draft) copy.

If you have any other questions, feel free to ask. Hopefully, we'll be
able to help :)

[1] Hado van Hasselt and Marco Wiering (2009). "Using Continuous
Action Spaces to Solve Discrete Problems". Proceedings of the
International Joint Conference on Neural Networks, IJCNN 2009,
Atlanta, GA, USA, 2009.
[2] Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing
71(7-9):1180–1190
[3] Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural
actor-critic algorithms. Automatica 45(11):2471–2482

Best,
Hado

Shashank Bhatia

unread,
Jun 18, 2011, 4:56:53 AM6/18/11
to rl-...@googlegroups.com
Thanks everyone for the insights.

@Hado: I would really appreciate if you can share the draf document that you are making currently.

@Brunocs , Damien : Thank you for providing me directions to Fourier Basis and Fitted Q Iteration.


Now the question remain for me is to find out which method works better for the kind of problem that I have.

I wonder if there is anyone who have already tried different function approximations (FAs) with a problem in continuous domain ?
Perhaps even if there are, the applicability of these FAs would depend on the benchmark problem they chose for evaluation.

I dont think there are some guidelines about these insights ??? Like in sa1 type of state space, f1 FA works, and in sa2 , f2 works ?

I guess these are the questions that everyone faces at some point or the other. And if they do, how to answer them ?? 

Is it trial and error that helps ?   Or are there any analytical methods ?


As of now, this is what I am going to do:

1. Simplify my state space.
2. Use CMAC with overlapping tiles
3. Use Fourier Basis
4. Use methods provided by Hado in references (actor critic calca etc)

These above methods will work with most of classic Q-learning methods

Lastly, i would look into new methods provided in this book: http://www.dcsc.tudelft.nl/rlbook/
  

Thanks everyone for their time!

Regards,

Shashank

Vassilis Vassiliades

unread,
Jun 18, 2011, 10:21:11 AM6/18/11
to rl-...@googlegroups.com
Hi Hado,

I would be very interested to read the survey you mentioned as well. Are you referring to your chapter in the upcoming book "Reinforcement Learning: State of the Art"?

Best,
Vassilis


On Fri, Jun 17, 2011 at 12:36 PM, Hado van Hasselt <hado.va...@gmail.com> wrote:
Hi Shashank,

George Konidaris

unread,
Jun 18, 2011, 6:29:31 PM6/18/11
to rl-...@googlegroups.com
Hi Shashank,

The Fourier Basis paper has an empirical comparison between the FB and a
few linear function approximators: RBFs, the Polynomial Basis, PVFs (a
learned basis) on some standard benchmarks.

I view RBFs as an improvement over CMACs because they are not piecewise
constant. A linear combination of piecewise constant basis functions
(BFs) is itself piecewise constant. Consider an RL agent representing a
VF and using a forward model to look one step ahead to select actions.
Such an agent essentially follows an increase in the value function, so
if it is on a value plateau (where the VF is piecewise constant), it
will select actions randomly until it leaves that plateau. Having a
function approximator that is guaranteed to produce this kind of feature
is undesirable.

This difficulty is usually dealt with by using a very high resolution
tiling (so that the plateaus are small), but that sacrifices
generalization (and therefore damages the agent's "startup"
performance), so in many cases you end up hacking together a combination
of high- and low-resolution tiling that does the job, in the end, after
much engineering.

Function approximation is a very well studied topic in the applied
sciences, where is classically accomplished using either a Fourier
Series approximation or a Wavelet approximation. The Fourier Series
approximation approximation has global support and is smooth, while the
Wavelet approximation approach has local support and can be
multiresolution.

RBF is a fixed basis set similar to the Wavelet approach (Mahadevan and
Maggioni have Diffusion Wavelets as a learned basis set; similarly,
Mahadevan's PVFs are the learned generalization of the FB over graphs).
With RBFs, the local support of each basis function is a win because it
lets you represent discontinuities and fine local features in your value
function, but you must choose their positions in state space and their
radii. You can also choose to mix and match different sized RBFs, if you
like. Smaller ones give you less generalization and more resolution,
larger ones more generalization and less resolution. In my experience,
RBFs pretty consistently outperform CMACs.

The Fourier Basis is based on the Fourier Series approach. Every BF is
defined everywhere, but differs as to its frequency. Low-frequency BFs
contribute to smoother parts of the value function, and high frequency
BFs contribute to "spikier" parts. The only parameter you have to set is
your maximum frequency - which essentially describes how spiky a value
function you think you might need.

We've found the FB to generally work very reliably, across a fairly
large set of diverse projects. I used it exclusively in all of the work
in my thesis, for example.

As an aside, to represent a general continuous value function you need
to have infinitely many BFs. Therefore, all fixed linear BF schemes will
have at least one parameter that determines how to limit the BF set.

Additionally, all fixed linear function approximation schemes face a
combinatorial explosion in the number of state variables. If you have a
large number of state variables, and the number of BFs becomes far too
high, then you need to do one of three things: 1) assume that each state
variable contributes independently to the VF, and hope for the best; 2)
use a learned basis, like PVFs or Diffusion Wavelets, that is extracted
from sample data and scales with the sample data, not the number of
state variables; or 3) use a feature selection method. It doesn't sound
like you need to do any of these, though.

- Gdk.

--
George Konidaris
Postdoctoral Associate, MIT CSAIL
http://people.csail.mit.edu/gdk

Erik Schuitema

unread,
Jun 20, 2011, 9:25:06 AM6/20/11
to rl-...@googlegroups.com
Hi George,

You're bringing up some interesting discussion points that I would like
to comment on, based on my experience with tile coding (which is close
to CMACs) for approximating high dimensional Q-functions of robotic tasks.

On 06/19/2011 12:29 AM, George Konidaris wrote:
> Hi Shashank,
>
> The Fourier Basis paper has an empirical comparison between the FB and
> a few linear function approximators: RBFs, the Polynomial Basis, PVFs
> (a learned basis) on some standard benchmarks.
>
> I view RBFs as an improvement over CMACs because they are not
> piecewise constant.

In high dimensional spaces, radial basis functions tend to have a
disadvantage over piecewise constant BFs such as CMACs because of the
relatively small volume they are influencing. Let's assume an RBF that
has decreasing values for increasing euclidean distance to its center,
such as a Gaussian bell shape. Compare this to a constant BF with the
shape of a hypercube with edge length R and dimension D. The hypercube
has volume R^D -- the volume that the constant BF is influencing. Now
consider a hypersphere that precisely fits in this hypercube, with
radius R/2. The hypersphere has volume ~(R/2)^D (nicely explained here,
forgive me for not having a scientific reference:
http://spacemath.gsfc.nasa.gov/weekly/6Page89.pdf). This means that by
far the largest volume of the hypercube is in its 'corners' and not in
its inner sphere, where the RBF has its main influence. RBFs therefore
tend to have a local peek influence (even after normalization). At any
point in space, the approximated function tends to be dominated by a
single BF. Therefore, when traveling through the state space, your value
function approximated with RBFs might be smooth, but you're still
'jumping' from feature to feature in a sense. With CMACs, however, the
number of BFs influencing a function value is fixed and all BFs have
equal influence. Moving around in state space will change your function
value in discrete but gradual steps.

> A linear combination of piecewise constant basis functions (BFs) is
> itself piecewise constant. Consider an RL agent representing a VF and
> using a forward model to look one step ahead to select actions. Such
> an agent essentially follows an increase in the value function, so if
> it is on a value plateau (where the VF is piecewise constant), it will
> select actions randomly until it leaves that plateau. Having a
> function approximator that is guaranteed to produce this kind of
> feature is undesirable.
> This difficulty is usually dealt with by using a very high resolution
> tiling (so that the plateaus are small), but that sacrifices
> generalization (and therefore damages the agent's "startup"
> performance), so in many cases you end up hacking together a
> combination of high- and low-resolution tiling that does the job, in
> the end, after much engineering.

The role of these value plateaus and their size is very interesting. If
the goal is to learn a good policy, and not the (action-)value function
per se, I agree that it is essential for the agent to leave the plateau
in one time step in order to make a good decision. This suggests that an
appropriate 'plateau size' depends on (at least) two things: the actions
that you are considering, and the length of the time step. With discrete
actions, it seems sufficient that different actions lead to different
'plateaus' as to be able to differentiate between their values.
Similarly, larger time steps (assuming continuous system dynamics)
usually lead to larger jumps in state space, making the difference
between actions more noticeable; this suggests that increasing the time
step allows you to use larger features. Although I have used these ideas
while manually tuning my tile coding function approximators with
success, I unfortunately haven't been able to formalize these ideas
during my PhD, nor have I found papers that use these notions in, e.g.,
automatically adjusting the tiling resolution.

An additional comment: with tile coding (CMACs), generalization does not
need to be sacrificed with resolution. You can use large tiles and many
overlapping tilings to achieve both a high resolution and wide
generalization.

Best regards,
Erik

Dietterich, Thomas G.

unread,
Jun 20, 2011, 3:20:37 PM6/20/11
to rl-...@googlegroups.com
It is true that these hyperrectangles (in tile coding) have lots of mass in
their (exponentially-man) corners. Do you have evidence that this is
desirable? It usually leads to poor performance (e.g., in decision tree
classifiers).

--Tom

--
Thomas G. Dietterich, Professor Voice: 541-737-5559
School of Electrical Engineering FAX: 541-737-1300
and Computer Science URL: http://eecs.oregonstate.edu/~tgd
US Mail: 1148 Kelley Engineering Center
Office: 2067 Kelley Engineering Center
Oregon State Univ., Corvallis, OR 97331-5501

Hi George,

Best regards,
Erik

--

Erik Schuitema - 3ME

unread,
Jun 20, 2011, 4:18:07 PM6/20/11
to rl-...@googlegroups.com
The use of non-binary features has been described in the CMAC documentation here (page 7): http://www.ece.unh.edu/robots/unh_cmac.ps
Their conclusion was (quoting): "Paradoxically, learning system performance is usually worse when using these higher order sensitivity functions for input dimensions greater than two."

I have done experiments in simulation myself, where I replaced the binary features in tile coding with RBFs (zero-valued outside the hyperrectangle, just like in tile coding) and came to the same conclusion. In two dimensions, non-binary features would beat the binary ones. But in higher-dimensional learning problems, performance was (much) worse. Normalizing the BFs and using the infinity norm distance (instead of Euclidian distance) improved my results. I did not do tests with RBFs that had non-zero values outside their hyperrectangle, since this would be computationally prohibitive for my target application.

Apart from the above empirical 'evidence', I would argue that more 'corner mass' means more generalization and thus faster learning. I would not have a theoretical argument on which method would be able to best represent an arbitrary function (with the smallest integral of squared errors, for example) after an infinite amount of learning samples (not that you would typically have infinitely many learning samples in high-dimensional spaces..).

Best regards,
Erik
________________________________________
From: rl-...@googlegroups.com [rl-...@googlegroups.com] on behalf of Dietterich, Thomas G. [t...@cs.orst.edu]
Sent: Monday, June 20, 2011 9:20 PM
To: rl-...@googlegroups.com
Subject: RE: [rl-list] Question about Q-Learning in large state space

Reply all
Reply to author
Forward
0 new messages